It’s rare to get exactly one recording of an a capella concert. Usually someone’s parents have a fancy but outdated camcorder, someone in the front row has a cell phone video with a great angle but terrible quality, and there’s a beautiful audio-only recording, maybe straight from the mixing board. All the recordings are independent, starting and stopping at different times. Some are only one song long, or are broken into many short pieces.
If you want to combine all these inputs into a video that anyone could watch, you’ll first have to line them up correctly in a video editor. This is a painful process of dragging clips around on the timeline with the mouse, trying to figure out if they’re in sync or not. The usual trick to making this achievable is to look at the audio waveform visualization, but even so, the process can be tedious and irritating.
This year, when I got three recordings from the VoiceLab spring concert, I resolved to solve the problem once and for all. I set about writing an automatic clip alignment algorithm as a patch to PiTiVi, a beautiful (if not mature) free software video editor written in Python.
Today, after about two months of nights and weekends, the result is ready for testing in PiTiVi mainline. Jean-François Fortin Tam has a great writeup explaining how it works from a user’s perspective.
I hadn’t looked into it until after the fact, but of course this is not the first auto-alignment function in a video editor. Final Cut Pro appears to have a similar function built in, and there are also plug-ins such as “Plural Eyes” for many editors. However, to the best of my knowledge, this is the first free implementation, and the first available on Linux. Comparing features in PiTiVi vs. the proprietary giants, I think of this as “one down, 20,000 to go”.
I guess this is as good a place as any to talk about the algorithm, which is almost The Simplest Thing that could Possibly Work. Alignment works by analyzing the audio tracks, relying on every video camera to have a microphone of its own. The most direct approach might be to compute the cross-correlation of these audio tracks and look for the peak … but this could require storing multi-gigabyte audio files in memory, and performing impossibly large FFTs. On computers of today, the direct approach is technologically infeasible.
The algorithm I settled on resembles the method a human uses when looking at the waveform view. First, it breaks each input audio stream into 40 ms blocks and computes the mean absolute value of each block. The resulting 25 Hz signal is the “volume envelope”. The code subtracts the mean volume from each track’s envelope, then performs a cross-correlation between tracks and looks for the peak, which identifies the relative shift. To avoid performing N^2 cross-correlations, one clip is selected as the fixed reference, and all others are compared to it. The peak position is quantized to the block duration (creating an error of +/- 20ms), so to improve accuracy a parabolic fit is used to interpolate the true maximum. I don’t know the exact residual error, but I expect it’s typically less than 5 ms, which should be plenty good enough, seeing as sound travels about 1 foot per ms.
My original intent was to compensate for clock skew as well, because all these recording devices are using independent sample clocks that are running at slightly different rates due to manufacturing variation. There’s even code in the commit for a far more complex algorithm that can measure this clock skew. At the moment, this code is disused, for two reasons: none of our test clips actually showed appreciable skew, and PiTiVi doesn’t actually support changing the speed of clips, especially audio.
If you want to help, just stop by the PiTiVi mailing list or IRC channel. We can use more test clips, a real testing framework, a cancel button, UI improvements, conversion to C for speed, and all sorts of general bug squashing. For this feature, and throughout PiTiVi, there’s always more to be done. I’ve found the developer community to be extremely welcoming of new contributions … come and join us.