NGA announces Soundscapes winners

On February 2, the National Geospatial-Intelligence Agency announced the final results of Soundscapes, a $100,000 prize competition seeking algorithms that geolocate the source of video and audio recordings on Earth. Solvers were asked to submit and test their code solutions, view their quantitative score and participate in a leaderboard on the Topcoder platform indicating solver rank based on the training data. Their submissions included three components:

  • A white paper describing their technical approach.
  • Test files indicating the city which the video originated from.
  • Confidence level generated by their method for each of the eight cities.

“GEOINT analysis of multimedia is a developing area, and the Soundscapes Competition has provided us with critical insights into the ability to automate geolocalization of data from non-speech audio cues.” said Michelle Brennan, NGA Image & Video Pod lead, and Soundscapes Competition sponsor. “We are delighted and encouraged by the level of interest shown by the community and the sophisticated solutions submitted by all of the solvers participating in the Soundscapes Competition.”

Summary of the technology used to address the Soundscapes Challenge:

  • This was a highly technical challenge focusing on machine learning techniques for classification of audio recordings into one of nine possible city classes.  There were a number of similarities in the winning approaches, including augmenting the data by modifying the audio clips in a variety of ways including adding white noise to the test files.
  • All of the winners converted the audio signals into a spectrogram, an image where the x axis represents time and the y axis represents frequency. These images can be analyzed using powerful deep learning methods and network architectures which are tailored to image processing tasks. Most of the Soundscapes Prize Competition winners have a strong history in image processing but were new to audio processing.
  • Once the audio signal was converted into an image, all of the winners then trained convolutional neural networks.
  • Most of the winners trained a wide variety of convolutional neural networks and then used an ensemble approach for final classification of the test data.

Meet the winners and learn more about their solutions below.

Main (Non-Speech) Winners:

  • 1st Prize, $27,000: Vladislav Leketush. The winning approach discovered some novel modifications to make on existing architectures. Leketush utilized an ensemble of convolutional neural networks as well as a different kind of neural network architecture called a long term short term memory network to generate the highest score of all participants.
  • 2nd Prize, $19,000: Selim Seferbekov. This approach is similar to the first place approach except for the fact that Seferbekov used an ensemble of CNNs and augmented the audio data using a different method than the first place solution.  Also, Mr Seferbekov executed a grid search to find the optimal hyper-parameters for their algorithm.
  • 3rd Prize, $13,000: Victor Durnov. This approach augmented the audio data before it was converted into a spectrogram, and again when it was in spectrogram form. Dumov used the novel approach of gathering statistics on the out of fold predictions and then utilized those statistics to calculate a weighted average for the ensemble model.
  • 4th Prize, $8,000: Yauhen Babakhin. This approach also converted audio data into a spectrogram and then used an ensemble of 40 deep convolutional neural networks. Although this is a high number of CNNs, Babakhin provided a method to reduce the number in CNNs in the ensemble for future use.  This algorithm used just one augmentation technique on the data before it was transformed to a spectrogram, and did not use any image augmentation on the spectrogram.  Overall this algorithm stuck to tried and true techniques.
  • 5th Prize, $6,000: Raphael Kiminya. This approach used an architecture that was pretrained for audio data called PANNs. Kiminya utilized an ensemble approach of 3 models and then averaged the outputs to obtain the final output prediction.

Secondary (Speech) Winners:

  • 1st Prize, $8,000: Vladislav Leketush
  • 2nd Prize, $5,750: Victor Durnov
  • 3rd Prize, $3,250: Selim Seferbekov

The call to better identify the actual recording location of audio and video files using acoustic-based machine learning methods is borne out of the ever-growing volume of multimedia being produced globally, which the agency hopes to develop in support of its mission to serve the nation and world through humanitarian aid and in support of national security. NGA believes this data, combined with cutting-edge machine learning models and the power of the crowd, can render such tools that GEOINT organizations simply do not yet have at their disposal. Soundscapes was designed to align with the newly released NGA Technology Strategy.

Source: NGA