Elixir RAM and the Template of Doom

April 26, 2016

I will attempt to convince you, in two lines of code, that Elixir is more interesting than any programming language you’ve ever used.

Are you ready? Don’t worry, the code doesn’t involve quicksort, or metaprogramming, or anything like that.

Here we go.

{:ok, file} = :file.open("/tmp/something.txt", [:write, :raw]).

:file.write(file, :re.replace("Hello & Goodbye", "&", "&amp;")).

The code itself is nothing special. It opens a file and writes a short string to it, replacing ampersands in the (hard-coded) string with the HTML entity &. You’re probably sharp enough to write the equivalent code in your favorite language in a couple of minutes, or less.

But your code wouldn’t be entirely equivalent to the Elixir code, at least, not from the computer’s perspective. If you run a tracing program like strace or dtruss on your non-Elixir code, you’ll probably see something like this:

write(0x3, "Hello & goodbye\0", 0x13)

Just the usual syscall we’d expect. But if you trace the Elixir program, you’ll instead see something like this:

writev(0x1A, 0x1A5405F8, 0x4)

What gives? What is that strange hexadecimal number, and more to the point, why isn’t good old write(2) good enough for Mr. Fancy Pants Elixir?

The answer, while not immediately obvious from the code sample, goes a long way towards explaining Elixir’s unique performance characteristics — as well as some anomalies you might encounter if you ever try benchmarking Erlang or Elixir HTML templates. Read on for a tale of technical subtlety and engineering that culminates in the in-memory rendering of the Template of Doom, a 30 gigabyte monster from the bottom of the sea. By the end of this post, I promise you won’t look at any web server the same way.

Anyway, enough gabbing from me… let’s dive in!

Gather ’round

To get a handle on things, let’s first revisit that string-replacement code:

:re.replace("Hello & Goodbye", "&", "&amp;")

If you paste that code into an Elixir shell, you’ll see something slightly unexpected:

["Hello ", ["&", "amp;"] | " Goodbye"]

Instead of a flat string, Erlang — oops, did I say Erlang? I meant Elixir — creates a nested list containing four leaf elements that add up to the string we expect: "Hello ", "&", "amp;", and " Goodbye". At first glance this may seem like a pointless complication, but let’s see what the computer thinks about the situation.

If you look at the man page for writev, you’ll see it’s a “gathered write”, which means it writes data from multiple memory locations in a single system call. I wrote a little DTrace script to unpack the writev call we saw earlier, and peek at what that Elixir code is actually doing with this system call. Here’s the log of that script:

  writev:return Writev data 1/4: (6 bytes): 0x000000001a3c0d78 Hello 
  writev:return Writev data 2/4: (1 bytes): 0x000000001a3c0d7e &
  writev:return Writev data 3/4: (4 bytes): 0x000000001a3c0b49 amp;
  writev:return Writev data 4/4: (8 bytes): 0x000000001a3c0d7f  Goodbye

The original hexadecimal number — in the writev call in the introduction — was the memory address of a vector. That vector contains the memory address of four other memory addresses, represented above as big hexadecimal numbers next to the strings at those addresses.

You can see from this log that Elixir is writing the elements of the nested list separately: "Hello ", "&", "amp;", and " Goodbye". But do you notice anything peculiar about the memory locations? Does one of them, perhaps, not look like the others?

0x000000001a3c0d78 Hello 
0x000000001a3c0d7e &
0x000000001a3c0b49 amp;
0x000000001a3c0d7f  Goodbye

I’m not very good with hexadecimal, so let’s lay out all the characters at their stated memory addresses. I am simply rearranging and expanding the data in the log above — bold-facing the start addresses for clarity.

0x0b49 'a'
0x0b4a 'm'
0x0b4b 'p'
0x0b4c ';'

...

0x0d78 'H'
0x0d79 'e'
0x0d7a 'l'
0x0d7b 'l'
0x0d7c 'o'
0x0d7d ' '
0x0d7e '&'
0x0d7f ' '
0x0d80 'G'
0x0d81 'o'
0x0d82 'o'
0x0d83 'o'
0x0d84 'd'
0x0d85 'b'
0x0d86 'y'
0x0d87 'e'

Do you see it yet? That nested list of string fragments — the list that makes up the new string — starts to make sense now. It’s just three pointers into the original string, plus an out-of-order pointer to the replacement string. In other words, there is no “new string” — there is just a set of modifications of the old string. (You can also see an extra tiny optimization performed by the regex engine — notice that the final string uses the ampersand from the original string, rather than from the replacement string.)

The data structure is called an I/O list. It’s designed to leverage writev, and thus to minimize data copies when writing to disk or to the network. Most languages wipe out the original string and copy the whole thing over, but they’re missing out on a whole class of no-copy (and late-copy) performance optimizations.

Of course, pointers aren’t a panacea. Sometimes data copying is cheaper than pointer juggling. Let’s use DTrace to explore the Erlang VM implementation, and see where the system draws various lines in its attempt to balance engineering considerations.

Tracing the limits

Try tracing these two lines of code:

:file.write(file, Enum.map(1..14, fn(_) -> "Foobar" end))

:file.write(file, Enum.map(1..15, fn(_) -> "Foobar" end))

Both attempt to write an I/O list to a file — the first one is "Foobar" repeated 14 times, the second is "Foobar" repeated 15 times. You might think these tasks will result in the same syscalls, but in reality they don’t. The first line of code calls writev with a vector of 14 elements, but the second one flattens the I/O list into a single blob of memory — then calls write on the concatenated data, much as other languages will.

If you try strings slightly longer or shorter than "Foobar", you’ll see that the trigger here is simply the number of elements in the list, rather than the total size of the resulting string. If you hit 15 elements in the list, Erlang will switch over to concatenating data into a contiguous chunk of memory. If you poke around the VM code you’ll see this is triggered by the constant SMALL_WRITE_VEC.

The DTrace log for the 14-element vector reveals something else with writev, something that might surprise you. Take a look:

 writev:return Writev data 1/14: (6 bytes): 0x00000000153807f8 Foobar
 writev:return Writev data 2/14: (6 bytes): 0x0000000015380528 Foobar
 writev:return Writev data 3/14: (6 bytes): 0x0000000015380568 Foobar
 writev:return Writev data 4/14: (6 bytes): 0x00000000153801b0 Foobar
 writev:return Writev data 5/14: (6 bytes): 0x00000000153801f0 Foobar
 writev:return Writev data 6/14: (6 bytes): 0x0000000015383218 Foobar
 writev:return Writev data 7/14: (6 bytes): 0x0000000015382138 Foobar
 writev:return Writev data 8/14: (6 bytes): 0x0000000015382178 Foobar
 writev:return Writev data 9/14: (6 bytes): 0x0000000015383258 Foobar
 writev:return Writev data 10/14: (6 bytes): 0x0000000015383298 Foobar
 writev:return Writev data 11/14: (6 bytes): 0x00000000153832d8 Foobar
 writev:return Writev data 12/14: (6 bytes): 0x0000000015383318 Foobar
 writev:return Writev data 13/14: (6 bytes): 0x0000000015383358 Foobar
 writev:return Writev data 14/14: (6 bytes): 0x0000000015380230 Foobar

Notice anything about the memory addresses? Well, don’t strain your eyes over it — I just mean to point out that they’re all different addresses. That is, there are 14 copies of "Foobar" lying around in memory.

Interestingly, that’s not the case if you move the "Foobar" literal outside of the closure in the Elixir code. If you trace this code instead:

 foobar = "Foobar"
 :file.write(file, Enum.map(1..14, fn(_) -> foobar end))

You’ll see the same end result, but get a rather different trace:

 writev:return Writev data 1/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 2/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 3/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 4/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 5/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 6/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 7/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 8/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 9/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 10/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 11/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 12/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 13/14: (6 bytes): 0x00000000153803b8 Foobar
 writev:return Writev data 14/14: (6 bytes): 0x00000000153803b8 Foobar

This time, the memory addresses are the same! How does that work?

Under the hood, small strings in Elixir are instantiated wherever the literal appears in code. So when "Foobar" is inside the closure, a new string is created whenever the closure is executed. When "Foobar" is outside the closure, each execution of the closure results in a new reference — to the same string.

If you’re coming from (say) JavaScript, this point might seem obvious — but remember, strings in Elixir are immutable, so we wouldn’t necessarily expect a new instantiation for each execution of the code.

In fact, these rules change for “big” strings, defined as 64 bytes or larger. (That’s the second magic number — ERL_ONHEAP_BIN_LIMIT in the VM code.) If big literals appear in a compiled module, they’re allocated on a shared heap and then ref-counted. So you’ll see the same memory address in the trace, regardless of where the literal appears in the code. (Worth noting: string literals in the Elixir shell always use the small-string rules. It’s… complicated.)

You might be thinking: that’s great and all — but who cares about "Foobar"? What’s any of this got to do with me?

Less is more (costly)

I am about to show you a benchmark that is going to blow your mind — and make you question everything you thought you knew about web server performance. (Well, everything except the great stuff you’ve learned in this article so far.) Are you ready? There’s no going back. This is something that, like that scene in Raiders of the Lost Ark where one guy’s face melts off and the other guy’s head explodes, you simply cannot un-see.

Here are two Elixir templates, wired up in the excellent Phoenix web framework.

Template the first:

<%= for a <- 1..@iters do %>
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah

BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
<% end %>

Template the second:

<%= for a <- 1..@iters do %>
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
<%= "" %>
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah BlahBlahBlahBlah
<% end %>

Each template produces identical output — a lot of BlahBlahBlahBlah. The only difference is that the second template inserts a string of length zero in the middle of each iteration.

Now, you’re not going to believe me, but one of these templates has 40% greater throughput than the other. But before you get excited about caching computed template fragments, you’d better have a close look at the benchmark output.

Benchmark the first (the template without the empty string):

Running 15s test @ http://127.0.0.1:4001/index2?iters=1000
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   352.28ms   80.35ms 609.15ms   72.32%
    Req/Sec    88.92     34.12   380.00     74.19%
  15873 requests in 15.10s, 6.07GB read
  Socket errors: connect 0, read 273, write 0, timeout 0
Requests/sec:   1051.00
Transfer/sec:    411.32MB

Benchmark the second (the template with the empty string):

Running 15s test @ http://127.0.0.1:4001/index?iters=1000
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   249.31ms   49.51ms 408.41ms   82.16%
    Req/Sec   125.34     35.84   320.00     77.01%
  22451 requests in 15.10s, 8.58GB read
  Socket errors: connect 0, read 333, write 0, timeout 0
Requests/sec:   1486.72
Transfer/sec:    581.79MB

Ruh-roh. Inserting an empty string into the template increases performance by 40%! How could that be?

If you’ve learned anything so far, it’s that syscalls matter — and here, strace and DTrace are your friends. Let’s use them to look at these templates from the computer’s perspective.

The first template produces (a lot of) output like this:

 writev:return Writev data 10/11: (410 bytes): 0x000000001fc7dd28 \nBlahBlahBlahBlah

The second template produces output like this:

 writev:return Writev data 2/2: (410333 bytes): 0x0000000026d3d248 HTTP/1.1 200 OK ...

That is, the first template is delivered to the client in tiny 410 byte chunks — but when the empty string is included, the second template (along with the HTTP response headers) is flattened into one big 410 kilobyte string instead. Verrrrrrrry interesting. Tell me you’re not at least a little curious about what’s going on.

The discrepancy can be traced to another compile-time constant in the Erlang VM — this one is known as ERL_SMALL_IO_BIN_LIMIT, and it determines the maximum string size that will be consolidated into one big string. It’s hard-coded to be four times the value of ERL_ONHEAP_BIN_LIMIT, or 256 bytes to be exact. Strings smaller than (or equal to) that limit that will be consolidated; larger strings will be preserved in the writev vector.

You should notice that this template example was deliberately contrived — the 410 byte chunks of BlahBlahBlahBlah exceed the limit, so they get their own entry in the writev vector. But if you split the chunks in half — the true and nefarious purpose of that mysterious empty string — the resulting 205-byte chunks are below the 256-byte limit, so they get consolidated. As it happens, on my computer, consolidation has 40% greater throughput for this particular template, but comes at the cost of higher RAM usage.

Which brings me to…

The Template of Doom

We now have enough information to render THE TEMPLATE OF DOOM, which was recently recovered from a sunken Spanish galleon:

<%= for a <- 1..100_000_000 %>
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
DOOM DOOM DOOM DOOM
<% end %>

That template will produce a file with the word DOOM repeated 6 billion times, totaling about 30 GB in size. Because the template consists entirely of chunks larger than 256 bytes, and the chunks are compiled to refer to the same memory address, we know the server should “only” be storing 100 million boxed pointers — order of magnitude 1 GB. (On my machine the VM topped out at about 5 GB of RAM usage while serving up a single copy of THE TEMPLATE OF DOOM.)

The best part is that you can make THE TEMPLATE OF DOOM crash the server just by inserting an empty string in the template. I dare you to try it.

Or rather, that’s only the second-best part. The best part is that you now understand why it crashes.

Now, before you email me about how you implemented THE TEMPLATE OF DOOM in assembly using only five molecules of RAM, remember these are just silly benchmarks on my underpowered MacBook — the point of this exercise isn’t “insert empty strings into your templates for speed” or “repeat the same string 100 million times in order to save memory.” It’s to help give you the tools to understand what’s going on under the hood of your system. Some engineering situations will call for small memory footprints, where the tiny-chunk writev strategy may be better; others will have great CPU caching, so the slab-of-RAM may be fine. These exercises should just be a starting point for your own explorations.

Parting wisdom

Elixir’s memory architecture fits neatly onto the problem of dynamic template rendering — if rendering speeds are killing your productivity and making your customers swim in the black pools of their own page-load depression, take a look at Phoenix or Chicago Boss. Because these frameworks work with writev and I/O lists (as well as kqueue/epoll), they slash CPU and RAM usage compared to non-Erlang systems, and in many cases obviate the need for server-side caching.¹

But the Erlang VM, clever as it is, does produce some oddities. From my highly informal testing, it looks like ERL_SMALL_IO_BIN_LIMIT in particular should be cranked up a bit, perhaps to 512 bytes. The guardians of the VM might consider making it a tunable compile-time parameter.

If you’re not using Erlang or Elixir, you might want to poke around your favorite language’s template implementations and see if they’re using writev under the hood. A tracer makes quick work of this task — strace on Linux and dtruss on OS X are a good start, but absolutely nothing beats DTrace for power and versatility.² If you don’t see writev in the trace logs, there might be some opportunity for improving the system’s memory architecture. Keep an eye on those memory locations in the trace output — a little string re-use can go a long way.

Finally, while I enjoyed using Elixir for these code examples, I found that Elixir’s libraries could make better use of I/O lists, particularly in the Regex and String modules, as well as with HTML entity-escaping. Without I/O lists, Elixir is just eager-copying memory like everybody else, and missing out on one of the features that makes the Erlang VM so technically interesting.

Notes

Even if a particular I/O list is ultimately flattened before being sent to the client — as with the second template in the examples — there’s a significant memory benefit to building the page as an I/O list, then flattening it all at once. Run a tracer to see how the Erlang VM is able to re-use I/O scratch space for better performance. ⇧
Spend an afternoon learning DTrace’s D script language. It’s an amazing tool. Here’s the little script I used for this article. ⇧

You’re reading evanmiller.org, a random collection of math, tech, and musings. If you liked this you might also enjoy:

Why I Program in Erlang
Type Punning Functions in C
A Taste of Rust
Four Days of Go
A Review of Perl 6 (Raku)

Get new articles as they’re published, via LinkedIn, Twitter, or RSS.

Want to look for statistical patterns in your MySQL, PostgreSQL, or SQLite database? My desktop statistics software Wizard can help you analyze more data in less time and communicate discoveries visually without spending days struggling with pointless command syntax. Check it out!

Wizard
Statistics the Mac way

Back to Evan Miller’s home page – Subscribe to RSS – LinkedIn – Twitter