Welcome to the ध्वनि blog.

ध्वनि the Sanskrit word for sound: from Proto-Indo-Aryan *dʰwaníṣ, from Proto-Indo-Iranian *dʰwaníš, from Proto-Indo-European *dʰwen- (“to make a noise”).

I am putting down all my thoughts and learnings from my journey into modular synthesizers. I am a software engineer by profession and have been exploring sounds for the past spending a lot of time just reading and researching on the topic.

Let's start with some basics first.

"How did you know how to do that?" he asks.
"You just have to figure it out."
"I wouldn't know where to start," he says.
I think to myself, That's the problem, all right, where to start.
To reach him you have to back up and back up.
And the further back you go, the further back you see you have to go,
until what looked like a small problem of communication turns into a major philosophic enquiry.
- Robert M. Pirsig, Zen and the Art of Motorcycle Maintenance

What is sound?

Sound is nothing but a result of pressure variations in the air (or any medium), caused by disturbances such as vibrations in our surroundings. For eg. if you hit a tuning fork it creates a disturbance in the air pressure and when that disturbance reaches us our brains translate that change in pressure to sound. Sound is a natural acoustic phenomenon of vibrations moving usually through air (or any kind of medium) to our ears.

Ok let's try to visualise what happens when a tuning fork is hit.

tuning fork compression and rarefaction

When the prongs of a tuning fork vibrate, they disturb the air particles around them. When the prongs move apart, they push against the neighboring air particles, causing compression. In this compressed region, the particles are forced closer together, increasing the air pressure. This compression represents the "push" of the wave but does not involve air particles traveling with the wave only their vibrations about their mean positions. When the prongs move back to their original position, the compressed air expands, causing the air particles to move slightly away from each other, resulting in rarefaction a region of lower pressure. The particles return to their original positions but continue to oscillate, contributing to the wave's propagation. Air particles don't move with the sound wave because sound is a mechanical wave (meaning it needs a medium to travel), and mechanical waves transfer energy, not matter. Sound is a pressure/compression wave.

tuning fork compression and rarefaction gif

Because of this motion of the air particles due to the vibrations from the tuning fork there are regions where the air particles are compressed together and other regions where the air particles are spread apart. These regions are called compressions and rarefactions respectively. The alternate compression and rarefaction form on the same point, there is no migration of the particles. Sound results from this mechanical wave caused by the back and forth movement of the particles through a medium.

If you could see sound waves, it would look more like what happens when you blow up a large beach ball. Imagine that you blow into a beach ball and it expands outwards in a spherical shape. When you take your mouth off the nozzle to take another breath the ball slightly deflates. As you blow the next breath into it the ball expands again and slightly deflates when you take the next breath, and so on until the ball is completely inflated. In a similar manner sound propagates away from the source in a sphere moving outward and then recoiling inward and then outward and so on.

How do we perceive sounds?

On the level of physics what we call sound is a vibration or referring an old word verberation. This is a wave that following the shaking movement of one or more sources propagate and reaches our ears where it gives rise to auditory sensations. Physically speaking sound is the shaking movement of the medium in question. Sound propagates in concentric cirlces similar to ripples seen on throwing a rock into calm waters. In the air the verberation propagates at a speed of approximately 343 meters per second. It was only at the beginning of the nineteenth century that this average speed of sound which varies slightly relative to pressure and temperature was able to be calculated. In water the wave propagates considerably faster (1500 m/s). Sound also travels through hard material and this is the reason why it is so hard to soundproof apartments that are located on top of the other. Sound in the physical sense has only two characteristics: frequency the number of oscillations (or vibrations) per second and pressure amplitude which correlates to the magnitude of the oscillation. Frequency is perceived as pitch and the amplitude is perceived as sonic intensity (loudness). All of the other perceived characteristics of sound are produced by variations of frequency and amplitude over time. The sound propagation takes place in every direction and it weakens in proportion to the square of the distance traveled. There is a reflection when the verberation encounters a surface that does not absorb it completely and returns a portion of it like a bouncing ball. Weh we hear sound via direct propagation from the source and a reflected sound the lag between the direct sound and reflected sound combine to produce reverberations. When a sound wave encounters an obstacle a portion will bypass it and this is called diffraction. Diffraction is what makes acoustic isolation difficult to achieve.

ear structure

The ears detect, analyse, and classify biologically interesting sounds and construct a model of auditory scene that surrounds us. The auditory system is attuned not only to listen to certain sounds but to ignore sounds that are not biologically relevant. Such sounds include ambient noises and the effects of sound reflection, refraction, diffusion which can distort the incoming sound. Imagine that you stretch a pillowcase tightly across the opening of a bucket and people throw balls at it from different distances. Each person can throw as many balls as they like and as often they like. Now your job is to figure out just by looking at how the pillowcase moves uop and down, how many people are there, who they are, and whether they are walking towards you, away from you or are standing still. This is what our auditory system has to do when making identification of auditory objects in the world using only the movement of the eardrum as a guide. (Of course the brain combines the data received through the eyes to form a holistic model of the current environment.)

Human ear is an organ which is external and internal at the same time. The ear as an organ is divided into outer, middle and inner. The outer ear consist if the pinna, the auditory or the ear canal and the eardrum. Pinna helps with protection (from wind and other things that might interfere with audition) and resonance. The funnel shaped pinna collects sound from the environment and its shape modifies the arriving frequency information depending on the direction of the source. The pinna directs the waves towards the tympanic membrane and its form favours frequencies that we use in verbal communication. The ear canal of the human adult is on average 7 to 8 millimeters in diameter and 2.5 to 3 centimeters deep. Its structure also tends to attenuate sounds that impede verbal communication mainly bass frequencies. Frequencies around 3000Hz are transferred more efficiently by the canal and we are more sensitive to this range of frequencies. The eardrum is bent by the force of arriving sound and transmits the motion to the middle ear. It vibrates more easily between 1-3.5kHz frequency range but transmits sounds to the inner ear over the entire audible frequency range. The ear drum and the middle ear play a crucial role in overcoming the mismatch between the low density air in the outer ear and the much of the denser fluid in the inner ear with a process called impedance matching. Impedance refers to the resistance a medium offers to the flow of energy. Air has low impedance while the fluid in the inner ear has high impedance. Without a mechanism to fix this mismatch sound waves sound waves mostly be reflected back when they hit the fluid in the inner ear. This would reduce the amount of sound energy entering the inner ear making hearing inefficient. Normally, when sound travels from a low impedance medium like air to a high impedance medium like water almost all of the acoustical energy is reflected. We still have very low understanding of how the eardrum accomplishes this task over such a wide frequency range.

inner ear structure

The middle ear is primarily made up of tympanic membrane and the series of tiny bones called ossicles that go by the names of "hammer" (incus), "anvil" (malleus), and "stirrup" (stapes). The bones function to transfer the pressure vibrations into mechanical vibrations which are passed along to the entrance of cochlea which is called the oval window. The tympanic membrane (eardrum) approximately one centimeter in diameter is an elastic membrane is in contact with these ossicles. This series of little bones transmit the vibrations from the tympanic membrane to the inner ear while amplifying the frequencies between 1000 and 4000Hz. It is here that the bones protect against the sounds that are too loud with a mechanism called acoustic reflex (also called the stapedius reflex or middle ear reflex). Not only does this mechanism protect us from loud sounds, it also dulls the loudness of our own voices when we speak. It involves tiny muscles attached to the ossicles that respond to the loud sounds by reducing the movement of the bones thus limiting the amount of energy transmitted to the inner ear. The stapedius muscle and the tensor tympani muscle contract reflexively with in 10-20 ms in response to loud sound pressures exceeding 70-100dB SPL (Sound Pressure Level). When these muscles contract they dampen the movement of the ossciles which stiffens the ossicular chain reducing the transmission of the sound vibrations reaching the inner ear. By stiffening the ossicles, the reflex reduces the intensity of the sound that reaches the cochlea by approximately 20 decibels.

cochlea cross section

The inner ear consists of organs that help with balance and audition. The audition is handled by cochlea Greek for "snail" due to its coiled shape. The stapes connects to the oval window at one end of the cochlea a fluid filled tube that connects to the auditory nerve. The other end is called a round window. The side of the oval window is called the scala vestibuli and the side of round window is called scala tympani The two scala are filled with perilymph a fluid similar to spinal fluid. The vibrations from the stapes to the oval window moves the fluid back and forth. The round window has a flexible membrane and acts as a pressure release point bulging outward when the fluid moves. The two scala enclose the scala media filled with endolymph which is similar to intracellular fluid. Within the scala media is the organ of Corti which rests on the basilar membrane and contains specialised sensory cells called hair cells. It includes 3 rows of outer hair cells and 1 row of inner hair cells.

basilar membrane frequency encoding

The basilar membrane runs down the center of cochlea. About 30,000 hair cells are attached to it along the length. (Note: this is the same hair cells we talked about in organ of Corti). The membrane is thinner at the base close to the oval window and round window. This thinner end vibrates more at high frequencies than its thicker end which is at its apex. Low frequencies vibrate the perilymph more intensely at the apex of the membrane. The position along the membrane encodes frequency required for audition.

cochlea spectral analysis graph

A short sample of the neural activity pattern with time (in ms) vs the length of cochlea (in cm) produced by the cochlea in response to a musical note composed of all the harmonics of 62.5Hz. The graph shows the temporal microstructure of the neural response of the cochlea's analysis of the sound.

As the membrane moves it causes the hair cells in the organ of Corti to bend. The top of these hair cells have tiny hair-like structures called stereocilia. When these bend they open ion channels leading to an electrical signal being generated. There are mechano-sensitive ion channels at the tips of the stereocilia and these channels are sensitive to mechanical deformation. On bending the tips stretch which pulls the channel open. This allows potassium ions (K+) which are abundant in the fluid endolymph to flow into the hair cell. This influx of potassium ions into the hair cell caused the inside of the cell to become depolarized (less negatively charged) which triggers the opening of voltage gated calcium channels in the hair cell membrane. Calcium ions (Ca2+) from the perilymph enter the hair cells and this leads to the release of the neurotransmitters from the hair cell into the spiral ganglion neuron which relay the signals from the inner ear to the brain stem. After the initial bending of the stereocilia, if the stereocilia bend back in the opposite direction, the ion channels close and the hair cell becomes hyperpolarized (more negatively charged). This shuts down the inflow of ions, halting the release of neurotransmitters and stopping the generation of action potentials.

Ok so the ears have processed the pressure waves, turned them into electrical signals, and transmitted them to the brain but how does the brain turn these signals into sound? Let's try to make sense of that by taking the example of music (organised sound).

The human brain is divided up into four lobes: the frontal, temporal, parietal, and occipital plus the cerebellum. The temporal lobe is associated with hearing and memory. The cerebellum, the oldest evolutionary part of the human brain going back to the reptilian brain is involved in emotions and the planning of movements. Different aspects of the music are processed by different parts of the brain and the brain uses functional segregation for processing it and uses a system of feature detectors to analyse specific aspects like pitch, timbre, tempo, etc. Brains are complex computational systems and networks of interconnected neurons perform computations which leads to thoughts, perceptions and ultimately consciousness. The average brain consists of one hundred billion (100,000,000,000) neurons, this is a lot of neurons but the real complexity arises through their connections. The formula for how n number of neurons can be connected to each other is 2(n* (n-1)/2).
For 2 neurons there are 2 possibilities for how they can be connected
For 3 neurons there are 8 possibilities
For 4 neurons there are 64 possibilities
For 5 neurons there are 1024 possibilities
For 6 neurons there are 32,768 possibilities
The number of combinations possible for an average brain is HUGE! Much of the computational power comes from this enormous possibility for interconnections and the fact that brains are serial processors unlike computers which are parallel processors.

This is how the auditory system processes sound: it doesn't have to wait to find out what the pitch of sound is to determine where the sound coming from. There are neural networks who dedicated to this process are separate. If one of the circuits is done processing it can send the information to other parts of the brain and they can start using that information. If the late arriving information affects the interpretation of the previous information our brains just "change their mind" and update the model of what they think of the incoming sound. This is called Predictive coding. As per the theory the brain constantly generating and updating the mental model of the environment to predict incoming sensory information. The theory is part of broader framework of Bayesian inference is the fundamental mechanism for perception, cognition, and action. The brain doesn't just passively receives information from the world but actively predicts it and uses incoming data to correct or refine its predictions.

How does the brain figure out from the disorganised mixture of molecules beating again the eardrum what's out there in the world?
The brain uses something called as feature extraction followed by this other process called feature integration. The brain extracts low level features using specialised neural networks that decompose the incoming signal into information about pitch, timbre, spacial location, loudness, etc. Frequency (pitch) is processed by specific regions of the auditory cortex that correspond to different frequencies, known as tonotopic organization. Neurons at various levels in the auditory pathway are topographically arranged by their response to different frequencies. This organization is referred to as tonotopy or cochleotopy. The primary auditory cortex located in the temporal lobe is tonotopically organised just like cochlea for detecting the frequencies in the sound. Neurons in the different regions of the auditory cortex are sensitive to specific frequencies. Cells that respond to low frequencies are located in one region of the cortex. Cells that respond to high frequencies are located in another region. As we have noted previously the same structure is followed by cochlea to detect the frequencies in the sound. Volume (loudness) is perceived based on the intensity of neural firing in response to the electrical signals. These operations are carried out in parallel and can operate somewhat independently of one another. This sort of processing where only the information contained in the stimulus is considered by the neural circuits is called bottom up (sensory) processing. In the brain these attributes of music/sound are separable. The term low level refers to the perception of the building block attributes of sensory stimulus.
At the same time as feature extraction is taking place in the cochlea, auditory cortex, and brain stem the higher level centers of the brain are receiving a constant flow of information about what has been processed so far and this information os continuously updated and rewritten if needed. Centers for processing higher thought mostly in the frontal cortex keep receiving these updates and they working quickly to predict what will come next in the music/sound. The enjoyment we experience from music stems from the brain's ability to predict upcoming musical patterns, such as rhythm, melody, and harmony. When these predictions are confirmed, it creates a sense of satisfaction, making the music feel rewarding. This interplay between expectation and fulfillment is a key reason why music is so pleasurable to our brains.
The frontal lobe calculations are called top down processing and it can influence the lower level circuits while they are doing bottom up processing. The top down (cognitive) and bottom up (sensory) processes communicate with each other in an ongoing fashion. Features extracted at the lower levels are integrated at higher levels to form a cohesive perceptual whole. Top down and bottom up processing is one of the key concepts in predictive coding helping us to form an accurate model of our surroundings.

The transformation of electrical signals into the conscious experience of sound in the brain is still one of the great mysteries of neuroscience. While much is known about the neural pathways involved in hearing the exact mechanism by which electrical signals are turned into the subjective perception of sound what we call qualia remains elusive. This phenomenon is part of the broader "hard problem of consciousness" a concept introduced by philosopher David Chalmers. But hopefully this article gives you a holistic view of how we perceive sound as soon as we start hearing it. I'll keep adding more stuff to this article depending on my further research into the topic.