Rediscover Urban Landscape by Vison-to-Audio Soundscape Synthesis
08/31/2023
1 Urban Environment is A Fusion of Sounds
The urban environment is composed of a diverse set of sounds, including traffic, industry, construction, and social activities. Together, these contribute to the conversion of the soundscape in noise pollution (de Paiva Vianna et al., 2015). However, noise can increase the risk of cerebro-cardiovascular disease such as stroke and arterial hypertension (Hahad et al., 2019), as well as cause psychological disorders such as depression and sleep disorders (Münzel et al, 2018). Evidence suggests that some sounds are annoying because they remind people of the presence of the source while preventing them from maintaining their desired mental state (Andringa & Lanser, 2011). Noise can also have positive effects. For example, in audio engineering and physics, the “color of noise” refers to the power spectrum of a noise signal. The practice of naming kinds of noise after colors started with white noise, a signal whose spectrum has equal power within any equal interval of frequencies (Lu et al., 2020). Different types of noise have unique properties and can be used for relaxation, sleep, concentration, and other therapeutic purposes (Söderlund & Sikstrom, 2007).
2 Audio Analysis and Synthesis
Audio analysis is the process of transforming, exploring, and interpreting audio signals recorded by digital devices. There are different applications of audio analysis, including speech recognition, voice recognition, music recognition and environmental sound recognition. Audio data represents analog sounds in a digital form, preserving the main properties of the original sound. It has three key characteristics — Period, amplitude, and frequency. “Period” is how long a certain sound lasts or, in other words, how many seconds it takes to complete one cycle of vibrations. “Amplitude” is the sound intensity measured in decibels (dB) which we perceive as “loudness”. Finally, “frequency” measured in Hertz (Hz), indicates how many sound vibrations per second.
3 Foley Sound
In filmmaking, foley, refers to the reproduction of everyday sound effects that are added to multimedia to enhance the audio quality (Choi et al., 2022). Traditionally, Foley artists must closely observe the screen to construct the augmented sound. A deep sound synthesis network that performs as an automatic Foley was proposed to generate augmented and enhanced sound effects as an overlay (Ghose & Prevost, 2021). A pleasant sonic environment or soundscape is characterized by the presence of meaningful sounds that concur with the character of the area (Booi et al., 2012).
4 Propose A Vison-to-Audio Way of Soundscape Synthesis
Evidence concerning the ecological validity of soundscape reproduction (Guastavino et al.,2005) suggests that a “feeling of being present”, i.e., being part of the environment, is important for the way we interpret sounds and sonic environments: either objective and detached or engaged.
This paper proposes a visual-context-based framework to synthesis soundscape to mediate the noise pollution to improve the quality of urban environment. Key elements in the video/image are extracted by AI and filtered by keywords such as “nature”, “machine”, “static”, and “fluid”. Foley audio related to the key elements are selected and used to synthesize new soundscape for the environment. To synthesize soundscape evoking certain types of color effects, the artificially composed sound is modified based on period, amplitude, and frequency.
Conclusion
Any urban environment has the issue of producing noise pollution. Through audio-and-vision AI powered soundscape synthesis, urban landscape can be experienced in a more positive manner. Even traffic-busy streets and crowded public spaces can be reshaped into enjoyable environments. The engineered urban soundscape keeps the trace of the urban context, providing a sense of presence. At the same time, it can promote our physical well-being.
Reference:
Akbal, E. An automated environmental sound classification method based on statistical and textural feature. Appl. Acous. 167, 107413 (2020).
Andringa, T., Lanser, J. Towards Causality in Sound Annoyance. In Proceedings of the Internoise Conference 2011, Osaka, Japan, 4–7 September 2011; pp. 1–8.
Booi, H., van den Berg, F. Quiet Areas and the Need for Quietness in Amsterdam. Int. J. Environ. Res. Public Health 2012, 9, 1030–1050.
Chen, M., He, X., Yang, J. & Zhang, H. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. In IEEE Signal Process Lett 25(10), 1440–1444 (2018).
Choi, K., Oh, S., Kang, M., and McFee, B. A proposal for foley sound synthesis challenge, 2022.
DaSilva, B., Happi, A. W., Braeken, A. & Touhafi, A. Evaluation of classical Machine Learning techniques towards urban sound recognition on embedded systems. Appl. Sci. 2, 1–27 (2019).
de Paiva Vianna K.M., Alves Cardoso M.R., Rodrigues R.M., Noise pollution and annoyance: an urban soundscapes study. Noise Health. 2015 May-Jun;17(76):125-33. doi: 10.4103/1463-1741.155833. PMID: 25913551; PMCID: PMC4918656.
Ghose S. & Prevost J. J., AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning, in IEEE Transactions on Multimedia, vol. 23, pp. 1895-1907, 2021, doi: 10.1109/TMM.2020.3005033.
Guastavino, C.; Katz, B.; Polack, J.; Levitin, D.; Dubois, D. Ecological validity of soundscape reproduction. Acta Acust. United Ac. 2005, 91, 333–341.
Guzhov A., Raue F., Hees J., and Dengel A., “Audioclip: Extending clip to image, text and audio,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 976–980.
Hahad O., Kroller-Schon S., Daiber A., and Munzel T., “The cardiovascular effects of noise,” Deutsches Ärzteblatt International, vol. 116, no. 14, pp. 245–250, 2019.
Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., and Plumbley, M. D. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
Lu, S. Y., Huang, Y. H., & Lin, K. Y. (2020). Spectral content (color) of noise exposure affects work efficiency. Noise & health, 22(104), 19–27. https://doi.org/10.4103/nah.NAH_61_18
Mu, W., Yin, B., Huang, X. et al. Environmental sound classification using temporal-frequency attention based convolutional neural network. Sci Rep 11, 21552 (2021). https://doi.org/10.1038/s41598-021-01045-4
Münzel T., Schmidt F. P., Steven S., J. Herzog, Daiber A., and Sorensen M., Environmental Noise and the Cardiovascular System, Journal of the American College of Cardiology, vol. 71, no. 6, pp. 688–697, 2018.
Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J. et al., Learning Transferable Visual Models from Natural Language Supervision, in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
Ren, Z. et al. Deep scalogram representations for acoustic scene classification. IEEE/CAA J Autom Sin 5(3), 662–669 (2018).
Söderlund GBW, Sikstrom S, Smart A. Listen to the noise: noise is beneficial for cognitive performance in ADHD. J Child Psychol Psychiatry. 2007;48:840–7.
Zhou J, Liu D, Li X, Ma J, Zhang J, Fang J. Pink noise: effect on complexity synchronization of brain activity and sleep consolidation. J Theor Biol. 2012;306:68–72.