A Virtual Canoe, An Interactive Poem, A Barking Dog

This is a virtual canoe. Take a seat, and use your forward arrow key to move downriver at your chosen pace. You’ll sail past a forest, through a cave, and into a marshland, using only your sense of hearing to guide yourself. For tutorial purposes, we’ve placed oblong rectangles as a visual aid, to show you explicitly where the virtual sound sources are.


Be sure to listen with headphones. If you do not hear anything after the program has loaded, click within the 3D world to initialize it.


This experiment came out of a series of discussions about how to use audio imaging (the use of sound sources placed around a listener in 3D space) in order to tell stories in new and interesting ways. Through this technique, we’ve also approached the idea of creating “audio poetry” that utilizes our innate ability to create mental maps of spaces through sound alone. What we’ve built isn’t so much a poem. It’s more a simulation of a virtual 3D space. The point of this exercise was to figure out the poetic language intrinsic to this medium, and the qualities of the medium that lend themselves best to poetic expression.

We’ve done a lot of thinking about what poetry is and how we can use these new tools to play with it. We’ve decided to not view poetry as a medium, and more as a method of expression. A poem, as we’ve come to associate it, exists through language—be that spoken or written. Of course, not all uses of language are terribly poetic: instruction manuals, encyclopedias, and textbooks would all make pretty poor text choices at your next soirée. There’s a reason you don’t write an instruction manual in prose: poetry involves getting tangled up in lyrical syntax, cultural connotations, and musical gestures embedded in the text. Language is often used for utility, but poetry’s interests are elsewhere.

Here’s a little analogy I like. There are an awful lot of YouTube channels that I tend to watch exclusively at double speed—things like software tutorials or slow news broadcasts. You wouldn’t think of watching a short film at double speed—that’s not the point of film. This isn’t to rag on tutorial videos, it’s simply to suggest that pacing isn’t intrinsic to the purpose of their creation. The point of a tutorial is to teach you where the damn “loop” button is, not exactly to evoke an emotional response. Thus, there’s a bit of a divide between what we’re referring to as “poetry” and “utility.” In fact, they exist on a spectrum! One isn’t better than the other—it’s all to do with the nature of what is being communicated.

Your sound wave x(t) and its associated transfer functions.

Your sound wave x(t) and its associated transfer functions.

On a technical level, this audio tour’s imaging works through something called a head-related transfer function (HRTF). Let’s say you’re walking down the street when you hear an incredibly loud car crash on your left. Your left ear will sense the initial sound wave a few milliseconds before your right ear picks it up. Likewise, the high-frequency shattering of glass will be heard loud and clear in the left ear, but seem a little bit muffled on the right. The sound wave refracts through all the gunk in your head before it gets to your right ear, dulling down the higher frequencies. An HRTF is a little bit of math that lets us take a sound wave, specify where in 3D space relative to the listener the sound will be coming from, and output a stereo file that mimics how a human head physically hears sound. Your body will practically involuntarily turn to the left in order to face a virtual car crash, just as you would in real life.

That being said, you’ll notice that there are a few areas in the above tour where the illusion is less than convincing. As it turns out, good spatial sound design has nothing to do with the quality of your HRTF model, and everything to do with how your scenario is set up. The creation of this 3D world was a pretty interesting lesson for us, one that taught us a few things about how to better design these spaces to be most effective.

We decided to compile a few of our learnings:


lesson one: The overall “realism” of an image is less important than the dynamism of the image.

Close your eyes and take a walk down the street (just be careful). Chances are you’ll be able to hear changes in the audio image as you walk, but even more amazing is the fact that you’ll probably be able to tell which objects are stationary and which are not. Your brain is pretty good at distinguishing which sounds are “in motion” towards or away from you, and which sounds only appear to be moving because of your own walking velocity. A lot of the reason you can tell where objects are when you move is because of the nuance your body has to sense its own change in position. But if we were to, somehow, perfectly record how your brain “heard” that blind walk down the street and play it back to someone sitting in a darkened room, chances are they wouldn’t have nearly as easy of a time distinguishing different types of relative motion. Is that dog barking getting louder because you’re walking towards it, or is it because it’s moving towards you? This kind of discrimination is never an issue in our day-to-day lives. However, when you’re designing a virtual river to be paddled down, it becomes an immediately obvious oddity.

Frankly, it doesn’t matter at all how believable your HRTF model is. What matters is that the dynamics of motion are clearly and explicitly conveyed. For example, our initial “scene” of the marshland only included a background ambient sound and a single barking dog. Despite being a professional-grade HRTF, the barking sounded confusing to both of us. Simply put, the single source of a barking dog didn’t give us nearly enough information to grasp onto where we were relative to the dog, if the dog was moving or if we were, and the speed at which we were moving. However, once multiple sounds were added to the scene, we could easily tell exactly where everything was and paint a mental picture on our velocity. Our conclusion is that a larger number of highly directional sounds using a poor HRTF is actually much better at communicating motion than a low number of sounds with a great HRTF. The addition of more sounds gives your brain more discrete data points to latch onto, helping stress the characteristics of your motion and the space around you.

lesson two: your mechanics dictate the sounds you can (believably) use.

In drafting the cave sequence, I thought it would be amusing to have the user paddle past a gigantic hibernating bear. In testing we found this was a terrible idea (but not for the reasons you might think!) For a variety of very pragmatic evolutionary reasons, it turns out your ears are superb at picking out where sleeping bears are relative to you. So much so that hearing a bear recording creep into our right ears resulted in everyone desperately looking for some sort of “head turn” mechanic. Adding a bear to the scene turns a quiet cruise down a river to a nightmare for your lizard brain searching for danger. The solution to this is obvious… let the user turn their head. However, that wasn’t really the point of the above experiment as we chose to work with one, simple mechanic. While this is a great thing to know if we were designing an action video game or some sort of binaural home security system (a possibility, I suppose?), it’s not terribly useful for creating audio poetry in the “genre” of the above simulation. Everything magical about poetry—all those little abstractions from utilitarian communication—immediately vanish when you think you might be lunch. Two words: avoid bears.

lesson three: if a setting is realistic, a transition out of that setting needs to be hyperrealistic.

Reverberation is an amazing thing. Change how long a sound reverberates for, and you’ll change the type of space people will believe they’re in. Reverb also lets us mentally map the rough dimensions, the volume, and the materials intrinsic to a space. The reverb of a sound is, in a sense, a fingerprint of its environment—a coin dropping onto concrete in a cathedral will sound different from the same coin dropping onto the concrete floor of your garage. Even more discretely, you’ll probably be able to distinguish a twenty-foot from a thirty-foot garage based on coin drop recordings alone. Reverb, like image, is deeply programmed into how your ears model the world. When we first put the canoe journey together, I modeled the forest’s reverb off of an actual forest, and the cave’s off the inside of an actual cave (shout out to the rangers of Kentucky’s Mammoth Cave, who didn’t seem to bat an eye when I recorded an impulse response there). Much to my surprise, not only did listeners not hear the difference in reverb, but they didn’t even understand that the setting had changed from a densely foliated exterior to a hard-walled, narrow interior. As it turns out, humans are much less good at modeling the reverb of spaces when they’re not actually in that space. I’m still quite astonished by this finding.

The conclusion we came to is that realism has nothing to do with producing an accurate spatial soundscape. In order to transition from forest to cave believably, reverberation times had to increase by an amount I can only refer to as “wacko.” The change in reverb demonstrated in this version of the audio tour would never occur in the real world. But keep in mind we have a lot to compensate for. Canoe into a cave and you’ll be host to a wide variety of sensory input: from the feeling of increased moisture in the air to the change in temperature on your skin. As a result, the pursuit of accurate soundscape models for the purpose of believability is flawed. A perfect replication of how sounds in the natural world interact is ultimately a poor tool for communication when external senses are absent. Instead, good storytelling through spatial soundscapes must use audio’s full range in order to draw the user’s attention to qualities of audio that aren’t terribly literal. In other words, it’s the creation of poetry.


Written poetry isn’t utilitarian communication precisely because of this reason: it’s the codification of language to communicate sight, touch, smell, taste, and all your other senses without making use of any of them. It compensates for this fact by drawing attention to the nonliteral qualities of language—communicating any variety of things through a shared cultural knowledge of how language works, how words look, and how the vocalization of a phrase sounds. If a poem relies too heavily on literal descriptions, it becomes banal. In fact, I’d extend that to poetry in any medium: a movie that puts all its expression through showy visual effects feels just as empty as a virtual canoe trip too focused on acoustic accuracy.

A model of a world poetically conveyed through spatial audio would look nothing like the world you and I live in. All poetry is kind of a model too. A poem is an experience that’s been “translated” into some sort of communication medium—be that a painting, dance performance, or website—and spat back out for someone else. This is because the real world itself isn’t poetry. I admire the scent of the ocean and the color of flowers as much as the next guy, but they only become poetry once they’re broken down into some sort of interpretable language, translated through a media machine, and spat back out in terms that humans can still relate to. Being in a field of flowers can surely be poetic, but it’s only what you make of it that becomes the poem.

An experience can be poetic, but it can’t be a poem.

A poem is an artifact.