Spatial Audio
- 2021-12-04
- 胡炳城, Bingcheng
Abstract
Nowadays, as apple comes up with AirPods 3, spatial audio is much hot and everyone wants to try what’s spatial audio. But all the devices that support spatial audio are expensive. So the implementation of spatial audio with cheap devices is useful. In this project, gyroscopes inside smartphones and front cameras of phones and computers are used as tools to support spatial audio. The webpage to demonstrate this idea can be found here.
Introduction
Spatial audio
Why do we need spatial audio? Actually, in early theaters, the sound system had only one channel. Until one day Blumlein visits the cinema with his wife, and he feels that one channel is strange because it hears like the sounds only come from one person.
Then come the stereo signal. Stereo panning is the simplest implementation of spatial audio, which works just like the real world, and all users were satisfied with stereo panning. But as technology develops, people thought just 2 channels are not able to fulfill users’ pursue of real spatial audio: audio comes from all directions.
Then surrounding audio came up, which used 7 speakers from the front, front left, front right, left, right, back left, and back right. This system is quite amazing and still used in many cinemas today.
But a system with many speakers could be expensive and not everyone can have a place to arrange this system, then comes calculated spatial audio.
Head-Related Transfer Function Implementation
Head-Related Transfer Function (HRTF) is used in this project, which is one of the best implementations of calculated spatial audio.
Human ears are lowpass filters and can produce significant attenuation. when a person turns his or her head left and right, the sound could be different. And this is why people can tell where is the source for the audio.
With an effective radius of head \(\alpha\), azimuth \(\theta\), and position of the zero \(\alpha\), we can calculate HRTF as below. \[ \begin{array}{l}\omega_{0}=c / a \\ \alpha(\theta)=1.05+0.95 \cos \left(180^{o} \cdot \theta / 150^{o}\right) \\ H R T F=\frac{\left(\omega_{0}+\alpha f_{s}\right)+\left(\omega_{0}-\alpha f_{s}\right) z^{-1}}{\left(\omega_{0}+f_{s}\right)+\left(\omega_{0}-f_{s}\right) z^{-1}}\end{array} \] And we can model the first-order allpass filter as below. \[ \tau_{h}(\theta)=\left\{\begin{array}{cc}-a(\cos \theta) / c & \text { if } 0 \leq|\theta|<\pi / 2 \\ a(|\theta|-\pi / 2) / c & \text { if } \pi / 2 \leq|\theta|<\pi\end{array}\right. \] Because human’s two ears are at different positions, with these formulas, the audio signal of two ears can be calculated separately.
Apple’s implementation
Just as the following image shows, there is an Inertial Measurement Unit (IMU) in each AirPods Pro, and users’ head position can be calculated in real-time. With these head orientation signals, Apple Music can calculate the spatial audio in real-time. If users turn their heads left and right, they can tell that different audios come from a different direction.
Actually, spatial audio had come up earlier than Apple's AirPods pro. Sony's PS4 and PS5 both have spatial audio with their specifically designed earphone. However, both of these devices are expensive, and not everyone has them (I have neither). So I implemented spatial audio with gyroscopes inside smartphones or cameras inside desktop computers.
Implementations
IMU-based Method
Below is the diagram for IMU-based spatial audio.
In this Web-App, the user’s head position/direction will be measured with a gyroscope inside the smartphone. Users should keep the direction of the phone the same as their heads. And then the audio will be generated in real-time to mimic spatial audio.
We recommend that you tie your phone on top of your head, with the phone screen facing up, and the top of the phone facing the tip of your nose. Although it's not convenient to bind your phone on the top of your head, it can be widely used for everyone!
This web app can be found here: https://spatialaudio.omniai.org/spatial_audio_for_phones.html
Camera-based Method
However, as wearing a mobile phone on top of your head is rather stupid, I decide to make a new version with the camera, as shown below.
In this solution, a TensorFlow model is used to generate face mesh, and then the direction of the head is calculated in real-time. There are two vectors, one is for the front, one is for the top of the head. And then the two channels of audio can be calculated in real-time with HRTF.
You could use your front camera to measure the face direction. Though there is a defect: your head can not turn back (or the camera can not find your face).
This web app can be found here: https://spatialaudio.omniai.org/spatial_audio_for_desktop.html
Final Version with Audio Visualization
The camera-based solution is more convenient, and a web app with 3D demonstration is designed. In this app, users can upload their own audio files and can see the real-time audio in different channels.
The first page is (a), where users can try if this app works. If the human head moves as you move your head, it means it works, or you could try the previous implementations. in the second page (b), users can choose whether use mirror mode and can choose the audio file. On the third page (c), press the play button and then you can hear the audio. With the audio playing, you can even see the volume in each channel, as (d) shows. You can close some channels by pressing it, and then focus on one channel to experience the magic spatial audio on page (e).
This web app can be found here: https://spatialaudio.omniai.org/
Libraries and code
Web APIs
Libraries
- Three.js
- TF.js
- @tensorflow-models/facemesh
- Semantic-UI
Code implementation
- The zipped code can be found in attachments.
- The GitHub repository is here.
Code Explaination
For this project, all functional code is in javascript format. The explanation of the .js
files is below.
orientation.js
get the orientation of smartphonesplot_head_movements.js
plot 3D head movementios_access.js
detect iOS devices3d_spatial_audio.js
calculate real-time signal by smartphone orientationcs_audio.js
calculate real-time signal by camera detectioncs_face_orientation.js
detect face orientation with tensorflow.js and then calculate the Euler Angles from direction vectorsdesktop_face_orientation.js
detect face orientation with tensorflow.jsface_3d_spatial_audio.js
detect face orientation with tensorflow.jsmulti_3d_spatial_audio.js
play different audio in different virtual position and then calculate the stereo spatial audio
References
- Cartesian coordinate system: Wikipedia
- Euler angles: Wikipedia
- Phone Orientaion: W3C
- Multichannel 7.1 and 5.1 Wav Test Files: jensign.com
- 7.1 surround sound: wikipedia
- Spatial Audio Slides: NYU BrightSpace
- Web audio spatialization basics: Moz://a