JAZZ IT UP

Abstract

Nowadays, as apple comes up with AirPods 3, spatial audio is much hot and everyone wants to try what’s spatial audio. But all the devices that support spatial audio are expensive. So the implementation of spatial audio with cheap devices is useful. In this project, gyroscopes inside smartphones and front cameras of phones and computers are used as tools to support spatial audio. The webpage to demonstrate this idea can be found here.

Introduction

Spatial audio

Why do we need spatial audio? Actually, in early theaters, the sound system had only one channel. Until one day Blumlein visits the cinema with his wife, and he feels that one channel is strange because it hears like the sounds only come from one person.

Then come the stereo signal. Stereo panning is the simplest implementation of spatial audio, which works just like the real world, and all users were satisfied with stereo panning. But as technology develops, people thought just 2 channels are not able to fulfill users’ pursue of real spatial audio: audio comes from all directions.

Then surrounding audio came up, which used 7 speakers from the front, front left, front right, left, right, back left, and back right. This system is quite amazing and still used in many cinemas today.

But a system with many speakers could be expensive and not everyone can have a place to arrange this system, then comes calculated spatial audio.

Head-Related Transfer Function (HRTF) is used in this project, which is one of the best implementations of calculated spatial audio.

Human ears are lowpass filters and can produce significant attenuation. when a person turns his or her head left and right, the sound could be different. And this is why people can tell where is the source for the audio.

With an effective radius of head \(\alpha\), azimuth \(\theta\), and position of the zero \(\alpha\), we can calculate HRTF as below. \[ \begin{array}{l}\omega_{0}=c / a \\ \alpha(\theta)=1.05+0.95 \cos \left(180^{o} \cdot \theta / 150^{o}\right) \\ H R T F=\frac{\left(\omega_{0}+\alpha f_{s}\right)+\left(\omega_{0}-\alpha f_{s}\right) z^{-1}}{\left(\omega_{0}+f_{s}\right)+\left(\omega_{0}-f_{s}\right) z^{-1}}\end{array} \] And we can model the first-order allpass filter as below. \[ \tau_{h}(\theta)=\left\{\begin{array}{cc}-a(\cos \theta) / c & \text { if } 0 \leq|\theta|<\pi / 2 \\ a(|\theta|-\pi / 2) / c & \text { if } \pi / 2 \leq|\theta|<\pi\end{array}\right. \] Because human’s two ears are at different positions, with these formulas, the audio signal of two ears can be calculated separately.

Apple’s implementation

Just as the following image shows, there is an Inertial Measurement Unit (IMU) in each AirPods Pro, and users’ head position can be calculated in real-time. With these head orientation signals, Apple Music can calculate the spatial audio in real-time. If users turn their heads left and right, they can tell that different audios come from a different direction.

airpods_IMUs

Actually, spatial audio had come up earlier than Apple's AirPods pro. Sony's PS4 and PS5 both have spatial audio with their specifically designed earphone. However, both of these devices are expensive, and not everyone has them (I have neither). So I implemented spatial audio with gyroscopes inside smartphones or cameras inside desktop computers.

Implementations

IMU-based Method

Below is the diagram for IMU-based spatial audio.

Gyro-based

In this Web-App, the user’s head position/direction will be measured with a gyroscope inside the smartphone. Users should keep the direction of the phone the same as their heads. And then the audio will be generated in real-time to mimic spatial audio.

We recommend that you tie your phone on top of your head, with the phone screen facing up, and the top of the phone facing the tip of your nose. Although it's not convenient to bind your phone on the top of your head, it can be widely used for everyone!

This web app can be found here: https://spatialaudio.omniai.org/spatial_audio_for_phones.html

Camera-based Method

However, as wearing a mobile phone on top of your head is rather stupid, I decide to make a new version with the camera, as shown below.

Camera-based

In this solution, a TensorFlow model is used to generate face mesh, and then the direction of the head is calculated in real-time. There are two vectors, one is for the front, one is for the top of the head. And then the two channels of audio can be calculated in real-time with HRTF.

You could use your front camera to measure the face direction. Though there is a defect: your head can not turn back (or the camera can not find your face).

This web app can be found here: https://spatialaudio.omniai.org/spatial_audio_for_desktop.html

Final Version with Audio Visualization

The camera-based solution is more convenient, and a web app with 3D demonstration is designed. In this app, users can upload their own audio files and can see the real-time audio in different channels.

The first page is (a), where users can try if this app works. If the human head moves as you move your head, it means it works, or you could try the previous implementations. in the second page (b), users can choose whether use mirror mode and can choose the audio file. On the third page (c), press the play button and then you can hear the audio. With the audio playing, you can even see the volume in each channel, as (d) shows. You can close some channels by pressing it, and then focus on one channel to experience the magic spatial audio on page (e).

This web app can be found here: https://spatialaudio.omniai.org/

Libraries and code

Web APIs

Libraries

Three.js
TF.js
@tensorflow-models/facemesh
Semantic-UI

Code implementation

The zipped code can be found in attachments.
The GitHub repository is here.

Code Explaination

For this project, all functional code is in javascript format. The explanation of the .js files is below.

orientation.js get the orientation of smartphones
plot_head_movements.js plot 3D head movement
ios_access.js detect iOS devices
3d_spatial_audio.js calculate real-time signal by smartphone orientation
cs_audio.js calculate real-time signal by camera detection
cs_face_orientation.js detect face orientation with tensorflow.js and then calculate the Euler Angles from direction vectors
desktop_face_orientation.js detect face orientation with tensorflow.js
face_3d_spatial_audio.js detect face orientation with tensorflow.js
multi_3d_spatial_audio.js play different audio in different virtual position and then calculate the stereo spatial audio

References

Cartesian coordinate system: Wikipedia
Euler angles: Wikipedia
Phone Orientaion: W3C
Multichannel 7.1 and 5.1 Wav Test Files: jensign.com
7.1 surround sound: wikipedia
Spatial Audio Slides: NYU BrightSpace
Web audio spatialization basics: Moz://a

Spatial Audio