What Makes Music Sound Good?
Was digging through folders on my Mac this week, and stumbled across a long paper I wrote at Yale, back in 2001, on the cognitive science of music – and, particularly, on what makes music sound good.
It appears I’d forgotten pretty much everything I once knew when I was writing the paper, so it was a fun and fascinating read for the current me. In case you have too much free time on your hands, and similar interests, the whole thing is below.
—
“It is night and the vacant cavern is dim, chilly, still. A few animals have arrived before the others, bustling about the immense expanse beneath the cavern roof sixty feet above. From time to time a cry echoes through the chamber and the flurry of activity increases. And then, all at once, a herd of two thousand shuffles in.
It is a highly territorial species and each animal seeks out its rightful station in the cavern. Those of highest status roost farthest in; others withdraw to murky alcoves. Outside, they had cooed and preened, dominated and submitted, but all that is finished now. It is time to nest.
The cavern visitors are a species of tool users, and when a group of a hundred more enter – individuals with distinctive black and white coloration – they carry oddly shaped wooden boxes and metal tubes to the front of the chamber, where they sit together. Abruptly, the dominant male struts in, climbs to a position above all the others, and performs a triumph display. His arrival is greeted by much hooting and clatter… The dominant male suddenly commences an elaborate display, swinging his forelimbs to and fro. It has begun.
Sound. Glorious sound. Sound of a kind little encountered outside the cavern, each tone a choir in itself, pure and enduring. Patterns ascend to gyrate in midair, then fold into themselves and melt away as even grander designs soar… first it conveys circumspect pleasure. Then delight. Then amazement. Then elation. For something is emerging from the patterns, something between the tones that is unheard yet is substantial as any sound. Voices hurl together; bass tones rise above a furious sweep of treble; the sound lowers its horns and charges. Deep within, there’s a tightening, a verging, a sensation of release from gravity’s pull. Ecstasy.” (Jourdain, 1996)
In this fanciful New York Times review of a New York Philharmonic concert, writer Robert Jourdain captures the essential paradox of music: clearly, the behaviors that generate musical sound are exceedingly complex, yet music seems to affect listeners at a simple, visceral, even primitive level. How, then, does music work? What is it that makes music sound good and what is it about musical sound the elicits such a strong emotional response?
The question itself is hardly new. The ancient Greeks, for example, examined the philosophy of music in depth. In fact, much of what we know about the structure of Western music today still stems from Pythagoras’ work in the 6th century BC. Yet with the many thinkers in various disciplines exploring different aspects of musical sound, relatively little concrete progress was made in determining what made sound into music and music into pleasure until midway through the twentieth century. By the 1970’s, through advances in other disciplines, researchers had begun to discover things about music almost as side effects from other research. Using tools in linguistics, computer science, neuroscience, cognitive psychology and many other fields, researchers began for the first time to explain key aspects of musical experience.
Despite this growing body of research, very few sources have attempted to tie the disparate findings together. That, essentially, is the aim of this paper: to lay out some of the most important musical research in the various sub-disciplines of Cognitive Science, then to try and use those findings in laying out an integrated picture of musical cognition (or, at least, in laying out as complete a picture as possible, while discussing which blanks remain and how they might potentially be filled by future research.)
Acoustics
An obvious place to begin a consideration of music is with an examination of the musical sounds themselves. The study of such sounds is the domain of musical acoustics, a field born in ancient Greece in the 6th century B.C when Pythagoras, listening to the pings of blacksmith’s hammers, noticed that hammers of weights related by ratios of small integers produced consonant musical intervals. A 1:2 ratio yielded an octave, 2:3 a perfect fifth, 3:4 a perfect fourth, 4:5 a major third, and so on. After a bit of experimentation, Pythagoras discovered a similar effect with lengths of plucked strings, and extrapolated his idea of consonant ratios from weight to number in general. Pythagoras was also the first to relate these ratios to the scale, building the Pythagorean Scale (the commonly used Western diatonic major) from successive fifths and their octaves. While Pythagoras helped to delineate what sounded good, he never truly investigated why it did. A sort of mysticism about numbers pervaded classical science and philosophy at the time, and the association of integers with musical intervals was regarded as sufficient explanation in itself, rather than as cause for further exploration. (Interestingly, at about the same time in ancient China, a similar mathematically-focused intellectual renaissance was underway. Less than a hundred years after Pythagoras, a Chinese thinker named Mei-shuei elucidated the same relationships, which he used to derive the pentatonic scale.)
The first key breakthrough in understanding the why of music was the relation of musical pitch to rate of vibration, which Mersenne and Galileo simultaneously discovered (apparently independently) in the 17th century. From this, Galileo proposed the first theory explaining why certain intervals sound consonant (one of the simplest yet most central issues in understanding why music sounds good). Galileo proposed that “agreeable consonances are pairs of tones which strike the ear with a certain regularity – the first and most pleasing consonance is, therefore, the octave, for every pulse given to the tympanum by the lower string, the sharp string delivers two.” Galileo’s theory of consonance is essentially a rhythmic one – the combination of consonant tones form a regular, repeating pattern that beats on the drum of the ear. Research into the time resolution abilities of the ear, however, seem to rule out such an explanation. The ear can resolve random pulses at speeds slightly slower than one millisecond (Patterson & Green, 1970, Pierce, 1990), while the sound of the upper keys of a piano cycles considerably faster (C8 has a fundamental of 4186 Hz). We have no difficulty hearing consonance and dissonance in diads and triads in the upper range of the piano, something we would not be able to do if Galileo’s theory were the correct explanation of consonance.
Mersenne, however, also made a second discovery that established a footing for a more plausible explanation of consonance. He was the first to hear pitches higher than the fundamental frequency in the sounds of plucked strings. In the early 19th century, Jean Baptiste-Joseph Fourier showed that musical sound was in fact the summation of stacks of the pitches Mersenne heard, pure sine waves called partials. In 1877, Helmholtz proposed a theory of dissonance based upon these partials. According to his theory, dissonance arises when partials are close together in frequency. The advent of the synthesizer allowed empirical verification of Helmholtz’s theory, and using synthesized sine waves, Plomp demonstrated increasing perceived dissonance levels in increasingly close frequency sine waves (Plomp, 1966). Helhmholtz’s theory also makes sense in the context of the Pythagorean scale. For the octave, the least dissonant interval, the partials of the upper tone all align with partials of the lower tone, and therefore add no dissonance. For the next most dissonant interval, the fifth, If the number of partials heard is modest (there are theoretically an infinite number of increasingly high partials, but usually only between six and nine partials are heard), a number of partials align, and the separation of the partials that don’t coincide are fairly wide. To a lesser degree, this is also true of other consonant intervals, like the fourth, and major and minor third.
Helmholtz’s theory is also beautifully tested (and demonstrated) on an ingenuous CD by Houtsma, Rossing and Wagenaars (1987). The demonstration is based on the idea that if both the intervals of a scale and the spacings of the initially harmonic partials are stretched to the same degree, partials of simultaneously sounding tones that coincided before stretching will coincide after stretching as well. Thus, the CD consists of several Bach chorales, each played four ways: normally; with both scale and partial spacing stretched 2:1; with only the scale stretched; and with only the partials stretched. As Helmholtz’s theory would predict, the dually stretched versions sound perfectly harmonic (albeit a bit odd), while either singly stretched version sounds absolutely horrific. Pierce has also demonstrated that we can even apply Helmholtz’s ideas to the pleasantness of timbre of a single musical tone (Pierce, 1990). If a tone has many strong successive partials, it is bound to sound harsh and buzzy because of the small spacing of the higher partials. (Harmon muted trumpet is a prime example of such a ‘dissonant’ timbre, and overtone spectrum plots show that nearly all successive overtones in the Harmon sound are approximately equally loud.)
Thus, an investigation of acoustics has given us the first piece in the music enjoyment puzzle. We now know why consonant intervals sound consonant, and why certain timbres sound more pleasant then others. From here, we must leave the realm of pure sound, and examine how sound interfaces with the human body, through the field of neurophysiology.
Neurophysiology
Open the auditory physiology section of most biology textbooks, and a well developed model for sound processing is presented: Sound is collected via the outer ear, amplified by the middle ear, translated from vibrational energy to electrical impulse by the cochlea, organized (according to pitch, location, intensity and a few other ways) in the cochlear nucleus, and then passed along in this organized form through several intermediary nuclei (superior olive, inferior colliculus and ventral medial geniculate) on route to the auditory cortex. While all of this may in fact be true, more recent research seems to indicate that such a picture of the auditory system is significantly oversimplified, and misses some of the more interesting aspects of hearing. The cause of this oversimplification seems to lie mainly in the technological limitations of the tools used in most early auditory physiology research. As one author lamented: “although one would like to know how musical stimuli are processed in the entire auditory system of waking, behaving humans, auditory physiology provides data mainly for the responses of single neurons to nonmusical stimuli in anesthetized nonhumans” (Weinberger, 1997). In recent years, with the emergence of more advanced imaging and detection tools, new research has provided some clues into the true complexity of the auditory process.
Systematic imaging studies, for example, have shown that neuronal discharge in the auditory cortex is highly dependent on task variables. That is, responses to the same physical acoustic stimulus is a function not only of the physical parameters of the stimulus, but also of whether or not the subject is selectively attending to the stimulus, and to which aspect of the stimulus the subject is attending (Goldstein, Benson & Hienz, 1982). Further research has demonstrated these effects throughout the full auditory path, and on all major characteristics of neuronal response (probability of response, rate of evoked discharge, latency of discharge, and pattern of discharge) (Miller, Pfingst & Ryan). These findings provide a very interesting possibility for a feedback loop in listening to music. Imagine attending to a specific aspect of a song, say a single melodic line in a polyphonic piece. That attention would cause an increase in perception of certain aspects of the melodic line. If those aspects were then fed to higher level response mechanisms that would derive a specific expectation from those aspects (such mechanisms will be discussed in later sections), attention would be shifted to the point of the expected event, restarting the cycle. Thus, when people say that they feel ‘possessed by the music’ they might not be wholly incorrect; through such a feedback loop, music might, in fact, cause a chain of largely involuntary effects in regards to both attention focus, and anticipation and release.
Studies have also demonstrated that learning seems to have an even more powerful role in modulating neural response to sound than does attention. Most descriptions of the auditory path (perhaps excited by the parallel with very similar findings in the field of vision) make much of the highly organized nature of the sound maps assembled in the cochlear nucleus and preserved throughout the path. Various types of learning, however, seem to actually alter the nature and shape of such maps, changing the amount of representation that various frequencies receive in the map (Weinberger, et al., 1990). At the neuronal level, such frequency learning in auditory cells has been shown to be associative, rapidly developing, and extremely long lasting, (Weinberger, Ashe & Edline, 1994) while at the whole-brain level, learning has shown metabolic activity increases in response to learned musical stimuli (Recanzone, Schreiner & Merzenich, 1993). Clearly, in light of both the learning and attention effects, the auditory system is considerably more complex and dynamic than it is often described.
The nature of the sound maps themselves is another area in which traditional interpretations of the auditory pathway seem a bit off. The most important aspect of such maps is the systematic organization with respect to acoustic frequency in the cochlea. The basal portions of the basalar membrane respond best to high frequencies and the apical parts of the membrane respond to low frequencies. There is then a point-to-point projection from the basilar membrane to the cochlear nucleus, where the frequency map is formed. Strictly speaking, this organization is ‘cochleotopic,’ and electrical stimulation to different parts of the basilar membrane was the first means by which this system of mapping was discovered. (Wazl & Woolsey, 1943) Nevertheless, almost universally the organization has been referred to as ‘tonotopic,’ that is, ‘frequency-topic’. The interesting question, however, is whether organization is truly by frequency, or rather by pitch. To explore this, several studies have taken advantage of an interesting phenomenon in human hearing differentiating pitch from simple frequency, called periodicity pitch: normally, pitch is determined by the fundamental frequency (the bottommost partial) heard; if the fundamental is removed, however, and the higher remaining partials are played unchanged, the heard pitch is that of the missing fundamental. This phenomenon demonstrates that something more is happening in pitch perception than simple basal membrane stimulation. (Humans aren’t the only species to recognize a difference between frequency and pitch. Through periodicity pitch, pitch recognition has also been demonstrated in cats [Chung & Colavita, 1976], birds [Cynx & Shapiro, 1986], and monkeys [Schwarz & Tomlinson, 1990]. This is particularly interesting considering the connection between upper partials and consonance and dissonance. The existence of periodicity pitch in non-humans seems to imply that our ability to experience consonance is a deeply ingrained and evolutionarily biological phenomenon.) MEG scans (a more advanced form of EEG) during periodicity pitch tasks have shown that at least in the auditory cortex, tonotopic map organization is driven by pitch rather than frequency (Pantev, et al., 1989, 1991). Similarly, lower down in the auditory pathway and at the cellular level, single cells in the superior olivary nucleus exist that respond best to the combination of two pitches related at an approximately 1.5 ratio, the same ratio as seen between the second and third partial (Sutter & Schreiner, 1991; Schreiner & Sutter, 1992). Some research suggests that even at the level of the auditory nerve (the first stop of processed sound after the cochlea), pitch rather than frequency is coded in the distribution of spike intervals across the fibers of the auditory nerve (Cariani, Delgutte & Kiang, 1992; Cariani & Delgutte, 1996). Thus, music, rather than simply sound, seems to be ingrained rather deeply into our neural organization.
Support for the idea of such low-level instantiation of music comes from some very interesting research in cats, involving their response to the same note played within differing melodic contours (McKenna, Weinberger & Diamond, 1989). Several neurons tested in cats’ primary auditory cortices showed varying levels of response to the same pitch, depending on whether the pitch was played as part of an ascending, descending or monotone series of notes. This is suggestive, as it implies that even certain melodic mechanisms are inherent to brain structure at a fairly primitive level. A further study of cat brains has also demonstrated the complexity of lower level music processing as compared to other cognitive tasks (Espinosa & Gerstein, 1988). Recording a tone’s effects on a simultaneously recorded array of 24 neurons showed complex simultaneous connectivity patterns from as many as all 24 of the recorded neurons. This is both remarkable and rather unprecedented, considering that most studies of local networks of neurons have rarely shown such interactions in more than three or four neurons simultaneously.
Because of this, most researchers have attempted to study such networks of neurons through the computer modeling of neural networks. Various networks have successfully modeled everything from the division of pitches into semitones and pitch-classes (Bharaucha, 1987; Laden & Keefe, 1989; Leman, 1991, Todd, 1988) to building patterns over the time as works of music unfold (Tekman & Bharucha, 1992; Todd, 1988, 1989). Comparisons between the predictions of such models and what we do know about music cognition in actual brains, however, casts serious doubt on neural networks (at least in their current forms) as a useful tool for elucidating the neurophysiology of the human mind. (McNellis, 1993; Ford & Pylyshyn, 1996) Essentially, in their current form, neural networks may be able to describe an abstract logic for why music might sound good, but they don’t seem to reflect the same structure and logic actually used in the brain.
None the less, neurophysiology isn’t the only perspective from which we can successfully study brain fnction. Through the field of Behavioral Neuroscience, researchers have analyzed brain structure at a slightly higher level, focusing on whole nuclei and brain regions rather than individual neurons.
Behavioral Neuroscience
A mainstay of the field of behavioral or cognitive neuroscience has long been the case study of patients who have lost a specific ability. In the case of musical ability, such a loss is characterized as amusia, which is used to designate “acquired clinical disorders of music perception, performance, reading or writing that are due to brain damage and not attributable simply to the disruption of basic perceptual, motoric or cognitive function” (Prior, Kinsella & Giese, 1990). The study of various aphasias (losses of language) has yielded tremendous insight into the neurobiology of linguistic processes, and it is natural therefore to hope that studies of amusias will yield equally helpful results.
In fact, early case studies rarely differentiated between aphasia and amusia, as the two were so often seen together. This, in combination with the popular view of music as a sort of language (to be discussed in a later section), was sufficient to cause early researchers to suppose that language and music shared neuroanatomical substrates (Feuchtwanger, 1930). Newer research, however, has called this longstanding assumption into question, especially a recent investigation of change in regional cerebral blood flow (CBF) by PET scan during keyboard performance and score reading in pianists (Sergent et al., 1992). In the case of the score reading task, along with the expected bilateral activation in extrastriate visual cortex (seen in any type of reading), a significant focus was observed in the left occipito-parietal junction. The researchers point out that activity in this region suggests a predominance of processing in the dorsal visual system, which is involved in spatial processing, as opposed to the ventral visual system, which is particularly important to the processing of words. Perhaps this shouldn’t be surprising, as reading music requires not only perceptual recognition of individual notation symbols (like notes, rests and expressive instructions), but also spatial discrimination of their positions on the staff. In the case of keyboard playing, aside from the activation seen for hand motion, CBF was greatly changed in the left inferior frontal gyrus and left premotor cortex. While the authors explain these activations as possibly related to the translation of musical notation to spatial information and the organization of motor sequences for keyboard performance, these active regions seen between both reading and playing may also account for the often seen overlap between amusia and aphasia. Observing the location of the two strongest areas of CBF change, the inferior frontal gyrus activation is immediately adjacent to Broca’s area, while the occipito-parietal junction area is also fairly close to Weinicke’s area; both Broca’s and Weinecke’s areas have been very well linked to a role in language processing and aphasia. As brain lesions are rarely very precisely localized, amusia and aphasia may have seemed linked as lesions in amusia-related areas may simply have also knocked out neighboring aphasia-related regions.
Supporting this ‘adjacent but separate’ view of musical processing, a recent review of several hundreds of aphasic and amusic case studies (Marin & Perry, 1997) found at least 13 documented cases of aphasia without amusia (including one prominent professor at the Moscow Conservatory who showed very severe aphasia, yet was still able to continue his work as a composer, creating works that received critical acclaim [The professor is V. G. Shebalin (1902-63); his Fifth Symphony was praised by no less a figure than Dmitri Shostakovitch as “a brilliant creative work” and “a creation of a great master.” The symphony was composed after he was brain damaged and became aphasic.]) and at least 20 cases of amusia with no aphasia. While such findings support the idea of a separation between language and music faculties, other case studies seem to suggest that music itself is even further modularized. For example, patients have shown dissociation between automatic and propositional behavior in nearly all functions. In a not unusual case, one patient, who was unable to spontaneously generate any speech or song, after hearing the beginning of the Star Spangled Banner would involuntarily join in, singing the entire national anthem with perfect lyrics, intonation, rhythm and prosody. (Shwartz, Marin & Saffrin, 1979). Another example of modularization can be seen in patients with musical anomia, who are unable to name heard songs or play named songs, but suffer nodecrease in their musical abilities otherwise; several of these cases even existed without more general linguistic anomia (Crystal, Grober & Masur, 1989).
Similarly, despite the localization of certain musical processes to specific areas in the left hemisphere, music cognition as a whole appears to be significantly less unilateral than language cognition. Several studies of paired subjects with unilateral brain damage seem to indicate that significant deficits in melody perception occur following either right or left unilateral lesions (Zatorre, 1989; Prior, Kinsella & Giese, 1990). In particular, tone-sequence perception seems to depend highly on various areas within the right hemisphere (Zatorre & Halpern, 1995; Samson & Zatorre, 1991). These finding have also been supported by various tests of musical ability during hemispheric anesthetization (with Sodium Amytal, the so-called Wada test) (Borchgrevink, 1991). For example, during anesthetization of the left hemisphere, both speech and singing are arrested, with both re-appearing at about the same time. During anesthetization of the right hemisphere, however, singing is possible, though with a complete loss of pitch control, which return gradually.
A variety of very recent studies of CBF have begun to place various musical modules in specific areas of the brain. Such tasks as working musical memory (Baddekey, 1992; Perry, 1994), ‘inner singing’ (Zatorre, Evans, Meyer & Gjedde, 1992), comparison of externally and self-generated tones (Perry, et al., 1993), and various applications in individuals with perfect pitch (Schlaug, Jancke, Huang & Steinmetz, 1995) have all produced consistent CBF unique signatures in various areas of the brain.
All of this, then, would suggest that the full set of cognitive musical processes are quite complex, modularized within varying regions of the brain, and quite separate from the infrastructure for listening to, mentally manipulating and producing language. This is particularly important, considering that many authors have theorized that the strong parallels between music and language are due to music somehow ‘piggybacking’ on our neurological predisposition towards language. Although this appears not to be the case, strong parallels exist between musical and linguistic structure, and exploring those parallels yields some of the deepest insights into what makes music sound good. We therefore turn next to the field of linguistics.
Linguistics
Well before Chomsky was applying reductive tree type thinking to the structure of sentences, musicians were applying similar techniques to the structure of compositions. Around the turn of the century, an Austrian composer named Heinrich Schenker set out to develop a set of rules explaining how one might build a complex composition from a very simple melodic line (Schenker, 1935). This process of Auskomponierung (composing-out), essentially entailed a “simultaneous unfolding of vertical and horizontal dimensions,” and depended on Schenker’s proposed close interconnection between the structure of concurrent (harmonic) and successive (melodic) groupings of pitches. Schenker initially intended his system for building an Ursatz (a simple underlying melodic progression) into a final piece of music (as represented in a musical score) through various transformational Schichten (levels). He quickly discovered, however, that most music could be analyzed in the opposite direction as well. It seemed, in fact, that nearly all tonal music was reducible through Schenker’s approach.
Schenker’s diagrams look strikingly similar to Chomsky’s sentence diagrams, a fact not lost on musicians and linguists. In particular, linguist Ray Jackendoff and composer Fred Lerdahl collaborated to extend and clarify Schenker’s thinking based upon advances in understanding of linguistic structure (Jackendoff & Lerdahl, 1980), laying out an amazingly simple set of four rules which serve as a ‘grammar’ for constructing musical forms. In their simplest form, the rules explain four types of relationships: grouping structure, which hierarchically segments a piece into motives, sections and phrases; metrical structure, which relates pitches to ‘strong’ and ‘weak’ beats at several hierarchical levels; time-span reduction, which categorizes pitches hierarchically by their ‘structural importance’ in grouping and metrical structure; and prolongational structure, which assigns a hierarchy of harmonic and melodic tension and relaxation to the pitches.
A key difference between generative linguistics and the Jackendoff/Lerdahl model is in the primary focus of the two systems. Generative linguistic models are primarily concerned with the grammaticality of various structures. Music theory, on the other hand, is more concerned with choosing the most aesthetic of a large number of competing ‘grammatically correct’ structures. Because of this, within each of Jackendoff and Lerdahl’s four rules are two types of sub-rule: well-formedness rules, which determine the ‘gramaticallity’ of a musical structure, and preference rules, which determine the relative aesthetic appeal of competing structures. While the vast majority of the research in generative linguistics has focused on what might be called well-formedness rules, a smaller body of studies have shown that something similar to preference rules may also exist in areas of linguistics, such as in pragmatics (Grice, 1975), the relative scope of quantifiers (Ioup, 1975), and phonetic perception (Liberman & Stuart-Kennedy, 1977).
The Jackendoff/Lerdahl model is exciting, however, not simply because of its similarity to existing models of language, but rather because it provides perhaps the most credible and comprehensive view of why music is pleasurable: good music sounds good because it adheres to the structures proposed by the model, and therefore can be subconsciously parsed by various cognitive mechanisms. Essentially, then the unfolding of this structure provides the ‘meaning’ of a piece of music. The musicologist L. B. Meyer (1973) points out that this meaning needn’t be designative meaning, referring programmatically or in some other way to objects or events outside of the musical domain. Rather, the meaning may simply be embodied meaning, providing significance to a listener through the interaction of that unfolding structure and the listener’s musical knowledge and expectations. By denying such expectations or by fulfilling them, either immediately or after drawing them out, a dynamic flux of tensions and resolutions can be created to influence the emotional and aesthetic response of the listener.
On the one hand, this brings us by far the closest to explaining the pleasure of music that we’ve yet come. For the first time, we can see a potential connection between musical sounds and pleasure. On the other, such a theory of musical pleasure still seems a world apart from the more empirically concrete lower level neurophysiologic and acoustic findings discussed to this point. For the highly abstract Jackendoff / Lerdahl model to hold water, it must somehow be tied to those lower level phenomenon. The essential question, then, is: are there any mechanisms that seem to be innate to the physiology and structure of the brain that might explain grouping of the type described in the Jackendoff and Lerdahl describe? Most fortunately, such a missing link has been found in the field of cognitive psychology, which has described, tested (and in some cases neurologically located) many of the cognitive grouping mechanisms that underlie such a structural approach. Let us therefore move on to the field of psychology.
Psychology
In the 1920s, a group of German psychologists, the Gestalt School, began to explore perception through the study of whole figures, rather than through the study of specific elements or dimensions of those figures, as done by psychophysicists and neurologists. In particular, the issue of how basic elements were grouped into such a whole was a key focus. While Gestalt Psychology fell out of vogue during the reign of Behaviorism, the field has recently experienced a second renaissance, through its role in helping to solve several key problems in computational vision. In that field, Gestalt principles have linked lower level research (at the level of spectral analysis) to higher level function (like shape detection and recognition). With Gestalt thinking having helped to create a more cohesive model of seeing, it is natural to hope that similar ideas might also help elucidate a full picture of hearing, and explain why music sounds good.
While Gestalt principles have been shown a wide variety of aspects of perception such as how the sound spectrum is divided into individual sound sources (see Matthews & Pierce, 1980 for harmonicity, Darwin & Ciocca, 1992 for onset synchronicity, MacIntyre, Schumacher & Woodhouse, 1981 for frequency modulation and Bregman et al., 1985 for amplitude modulation), and segment each of the resulting source-specific streams into individual notes (Deutsch, 1994, 1997), the focus of this section will mainly be on how Gestalt principles are used to combine those notes into the higher level groupings upon which the Jackendoff / Lerdahl theory depends.
The first of such principles is pitch proximity, an extension of the more general Gestalt principle of proximity, which essentially states that closer elements are grouped together over elements spaced further apart. In this row of characters, for example, the closer x’s are perceived as pairs:
x x x x x x
In a very similar way, notes close to each other in pitch are construed as part of a single melodic line. An empirical demonstration of this effect was seen in an experiment in which two well-known melodies were played by alternating between notes of the two melodies (Dowling, 1973). When the melodies were in overlapping pitch ranges, their components were perceptually combined into a single stream, and subjects had extreme difficulty in identifying the two original melodies. When the alternating melodies were instead played in different pitch ranges, they were readily separated out perceptually, and easily identified. (This effect is put to beautiful use in Arban’s Variations on the Carnival of Venice, a song near and dear to any trumpeter’s heart. In the last variation, the alternation between notes widely separated between the low and high register of the horn gives the illusion of two separate trumpets playing simultaneously.)
While the principle of proximity operates on pitch, it also operates over time. Actually, early research demonstrated an interdependence between time and pitch. (Schouten, 1962). Essentially, studies found that as the pitch separation between successive tones increased, it was necessary to reduce their presentation rate to maintain the impression of a connected series. We also use time proximity to divide groups of pitches by the pauses interspersed between notes. In one experiment, generated sequences consisting of tones of identical frequency, amplitude and duration were separated by gaps of two alternating durations (Povel & Okkerman, 1981). Subjects perceived these sequences as repeating groups of two tones segmented according to the large gap (very much like the visual example presented above).
Another Gestalt principle shown to operate on music is similarity; we tend to group together similar items. Here, we perceive two columns of x’s and two columns of o’s:
x o x o
x o x o
x o x o
x o x o
The timbre of sounds (the sound quality, or the specific balance of various partials within a sound) has been demonstrated as a modality within which similarity can be judged. Even when the registers of two timbrely differing melodic lines heavily overlap (and might therefore be grouped singly via proximity), the lines tend to be heard separately, due to the stronger grouping by timbre similarity within each melody. A striking demonstration of this grouping tendency is found in a study by Wessel (1979). He presented subjects with a repeated pattern consisting of a three tone ascending pitch line, with successive tones composed of alternating timbres. When the timbral difference between successive tones was small, listeners heard the pattern as composed of a repeated ascending line. When the timbral difference was large, however, listeners linked the tones by timbre rather than pitch, and heard two interwoven descending lines instead.
A third relevant Gestalt principle, good continuation, states that elements that follow each other in a given direction are perceptually linked together: Here, the symbols appear to make two crossing lines:
<
<
<
< < < < < <
<
<
<
Various studies in groups of three and four pitches (Divenyi & Hirsh, 1974; McNally & Handel, 1977) have demonstrated that sequences are much more likely to be judged as coherent when the tones form unidirectional rather than bi-directional changes.
A fourth Gestalt principle states that we tend to form groupings so as to perceive configurations that are familiar to us. While the groupings previously discussed apply equally to naïve and experienced listeners (an important point in support of the Jackendoff/Lerdahl model, which claims that structure is understood even by musically naïve individuals), the familiarity principle implies that some differences should be seen between listeners this various levels of musical expertise. This principle has been demonstrated more generally in many ways, perhaps most famously in studies of expert chess players; expert players show memory advantages in recalling viewed boards, but only when the boards are actual games rather than random collections of pieces (Chase & Simon, 1973; Canberg & Albert, 1988). Their advantage manifests itself in recall of groups of pieces, suggesting that expertise is associated with the development of a larger set of perceptual patterns for the grouping of incoming stimuli. Research in the realm of music has borne this out, showing that experts hold strongly improved ability for the recall of genre-specific chunks (i.e. the ii-V-I chord progression), but not in the recall of more atypical musical stretches (Baddeley, 1992).
Some research also suggests that expert (or even moderately experienced) listeners not only group musical sounds differently, but might actually perceive those musical groups differently aesthetically. In one study, groups of nonmusicians and professional and amateur musicians were asked to judge whether two successive chords “sounded good” together. The chords were four voice triads, either in the same tonality or different tonalities, with no preceding key context, and articulation according to classical harmonic techniques designed to effect the smoothest and most consonant transitions possible. Of the groups, nonmusicians rejected the most pairs of chords (around 25% of the pairs) and were also slowest in deciding whether the pairs sounded good. The amateurs rejected fewer chords, and were somewhat faster than the nonmusicians in deciding. The professionals rejected the least number of chord pairings, and reacted considerably to all pairs. In justifying the “goodness” of a pair, profesionals also tended to report factors other than simple tonal relationship, like the movement used in outer chord voices, or whether such pairs might be found in Debussy or Chopin. Three of the professionals who reported having extensive experience with contemporary classical compositions were by far the least sensitive to tonal relationship, rejecting almost no pairs (Marin & Barnes, 1985).
These findings in Gestalt Psychology provide strong empirical support for the basic rules proposed in Jackendoff and Lerdahl’s model. Similarly, many of these principles are currently being traced to specific brain regions and modeled neurologically, providing continuity from the lower end as well.
Conclusions
From all of this, then, we can now postulate a very basic picture of why music sounds good. Certain aspects of musical sound are dictated by the nature of sound waves themselves; the remaining aspects cater to innate grouping mechanisms neurologically instantiated within the brain. These grouping mechanisms form a grammar of musical structure. Music that fits within that structure is perceived as good, and pleasure is derived from that goodness, either by the tension and release dictated by the expectations that structure creates.
Clearly, while we this gives us a simple sketch of the how the process of musical enjoyment might work, a vast array of details remain. Hopefully, as some of the newer technologies used in brain research are applied to known phenomenon in musical perception, stronger links will be formed between conceptual perception modules and the underlying mental substrate. Additionally, it seems inevitable that the field of psycho-musical research will increasingly gel, bringing together thinkers in various disciplines, and refining the techniques used experimentally in all of the disciplines. In current research, for example, there is very little consensus on establishing control levels of musical ability – there isn’t a sense of what constitutes basic musical competence, at least not in a sense comparable to Chomsky’s linguistic competence (Chomsky, 1970), which has been essential to much linguistic research.
Another important issue is the scope of the research presented. The vast majority of the research on musical cognition has focused on tonal, Western music. The first obvious question, then, is whether the findings in Western music (especially those presented as somehow innate) are also seen in other world musics. The very short answer is, yes, most of the cognitive structure that seems to underlie music appears to varying degrees in nearly every culture studied. In fact, the differences seen between various musics in many ways parallel those seen between various world languages, which are similarly proposed to be differing instantiations of a single underlying structure. For a further discussion of various ethnomusicalogical questions, one might see the concurrent senior project of Andrew Gass, an anthropology major. The second big question, is that of modern ‘atonal’ music, which doesn’t seem to fit in with the discussed findings. In part, it appears that appreciation of such music is mainly due to the great flexibility of the musical system, which gradually habituates to presented stimuli, even if they are at first perceived as dissonant. His was well demonstrated in the previously discussed study of chord pair ‘goodness’ judgments, where musicians with experience in modern music were considerably more willing to accept pairs as ‘pretty’ that untrained individuals felt didn’t go together. A second example of this type of effect is seen in a study of individuals classified as professionals, who were asked to sing resolutions to varying inversions. Musicians experienced with modern music tended to use less standard resolutions, including one “internationally known contemporary composer” who demonstrated “highly idiosyncratic” tone choices, rarely resolving to the tonic and dominant as preferred by the more traditional musicians, and often resolving even non-diatonically (Marin & Barnes, 1995). Perhaps, in fact, a model of music cognition may help determine the limits to which tonality can be pushed and still be accepted as ‘musical’ by various audiences.
More importantly, an interdisciplinary model of musical enjoyment serves as wonderful proof-of-concept for Cognitive Science as a field. It is certainly the expressed hope of most researchers who are focused in on a specific cognitive task that their research will somehow help to determine underlying structures and principles about brain function in general. This is also a basic hope of Cognitive Science as whole – that bringing together thinkers in a variety of areas and focusing in on specific issues is the best way to get a ‘foot in the door’ in understanding the extreme complexities of the human brain. The (relative) ease with which a picture of musical comprehension can be built upon existing thinking in vision, language and auditory perception augers very well for a Cognitive Science type approach to understanding the human mind.
Bibliography
Baddeley, A. (1992). Working Memory. Science, 255, 556-559.
Bharucha, J. J (1987). MUSACT: A connectionist model of musical harmony. Proceedings of the Cognitive Science Society. Hillsdale, NJ: Erlbaum.
Borchgrevink, H. M. (1991). Prosody, musical rhythm, tone pitch and response initiation durin Amital hemisphere anaesthesia. In J Sundberg (Ed.), Music, language, speech and brain: proceedings of an International Symposium at the Wenner-Gren Center, Stockholm, 1990 (pp. 327-343). Cambridge, England: Macmillan Press.
Bregman, A. S., Abramson, J., Doehring, P. & Darwin, C. J. (1985). Spectral integration based on common amplitude modulation. Perception and Psychophysics, 37, 483-493.
Cariani, P. A. & Delgutte, B. (1996). Neural correlates of the pitch of complex tones Pitch and pitch salience. Journal of Neurophysiology, 76, 698-716.
Cariani, P. A., Delgutte, B. & Kiang, N. Y. S. (1992). The pitch of complex sounds is simply coded in interspike interval distributions of auditory nerve fibers. Society for Neuroscience Abstracts, 18, 383.
Chase, W. G. & Simon, H. A. (1973). The mind’s eye in chess. In W. G. Chase (Ed.) Visual information processing. New York: Academic Press.
Chomsky, N. (1970). Remarks on nominalization. In R. A. Edwards & P. S. Rosenbaum (Eds.), Reading in English transformational grammar. City: Ginn and Company.
Chung, D.Y. & Colavita, F. B. (1976). Periodicity pitch perception and its upper frequency limit in cats. Perception & Psychophysics, 20, 433-437.
Crystal, H., Grober, E., & Masur, D. (1989). Preservation of musical memory in Alzheimer’s disease. Journal of Neurology, Neuroscience and Psychiatry, 52, 1415-1416.
Cynx, J. & Shapiro, M. (1986). Perception of missing fundamental by a species of songbird. Journal of Comparative Psychology, 100, 356-360.
Darwin, C. J. & Ciocca, V. (1992). Grouping in pitch perception: effects of onset asynchrony and ear of presentation on mistuned components. Journal of the Acoustical Society of America, 91, 3381-3390.
Deutsch, D. (1994). Pitch proximity in the grouping of simultaneous tones. Music Perception, 9, 185-198.
Deutsch, D. (1997). Grouping Mechanisms in Music, Music Perception, 24,793-802.
Divenyi, P. L. & Hirsh, I. J. (1974). Identification of temporal order in three-tone sequences. Journal of Acoustical Society of America, 56, 144-151.
Dowling, W. J. (1973). The perception of interleaved melodies. Cognitive Psychology, 5, 322-337.
Espinosa, I. E. & Gerstein, G. L. (1988). Cortical auditory neuron interactions during presentation 3-tone sequences: Effective connectivity. Brain Research, 450, 39-50.
Feuchtwanger, E. (1930). Amusie. Studien zur patologischen Psychologie der Akustischen Wahrnehmung und Vostellung und ihrer Struckturgebiete besonders in Musik und Sprache. Berlin: Julius Springer.
Ford, K. & Pylyshyn, Z. W. (Eds.). (1996). The Robot’s Dilemma Revisited. Stamford, CT: Ablex Publishers.
Galileo Galilei. (1954) Dialogues concerning two new sciences. (H. Crew & A. de Salvio, Trans.) New York: Dover Publications. (Original work published 1637)
Gjerdingen, R. O. (1990). Categorization of musical patterns by self-organizing neuronlike networks. Musical Perception, 8, 67-91.
Goldstein, M. H., Benson D. A., & Hienz R. D. (1982). Studies of auditory cortex in behaviorally trained monkeys. In C.D. Woody (Ed.), Conditioning representation of involved neural functions (pp. 307-317). New York: Plenum Press.
Grice, P. (1975). Logic and Conversation. Syntax and Semantics, 3
Gulick, W. L., Gescheider, G. A. & Frisina, R. D. (1989). Hearing. Oxford: Oxford University Press.
Helmholtz, H. L. F. On the sensations of tone as a physiological basis for the theory of music (A. J. Ellis, Trans.). New York: Dover. (Original work published 1877)
Houtsma, Rossing and Wagenaars. (1987). Auditory Demonstrations. IPO-NIU-ASA.
Ioup, G. (1975). Some universals for quanitifier scope. Syntax and Semantics, 4
Jackendoff, R. & Lerdahl, F. (1980). A Generative Theory of Tonal Music. Cambridge, Mass: MIT Press.
Krumhansl, C. L. (1990). Cognitive foundations of musical pitch. Oxford: Oxford University Press.
Laden, B. & Keefe, D. H. (1989). The representation of pitch in a neural net model of chord classification. Computer Music Journal, 13(4), 12-26
Leman, M. (1991). The ontogenesis of tonal semantics: results of a computer study. In P. Todd & G. Loy (Eds.), Connectionism and Music. Cambridge, MA: MIT Press.
Liberman, A. & Studdart-Kennedy, M. (1977). Phonetic perception. In R. Held (Ed.), Handbook of Sensory Physiology. Heidelberg: Springer.
MacIntyre, M. E., Schumacher, R. T., & Woodhouse, J. (1981). Aperiodicity in bowed string motion. Acustica, 50, 294-295.
Marin, O. S. M., & Perry, D. W. (1997). Neurological Aspects of Music Perception and Performance. In D. Deutsch (Ed.), The Psychology of Music. San Diego: Academic Press.
Matthews, M. V. & Pierce, J. R. (1980). Harmony and nonharmonic partials. Journal of the Acoustical Society of America, 68, 1252-1257
McKenna, T. M., Weinberger, N. M., & Diamond, D. M. (1989). Responses of single auditory cortical neurons to tone sequences. Brain Research, 481, 142-153.
McNally, K. A. & Handel, S. (1977). Effect of element composition on streaming and the ordering of repeating sequences. Journal of Experimental Psychology, 3, 451-460.
McNellis, M. (1993). Learning and recognition of relative auditory spectral patterns. Unpublished doctoral thesis, Darmouth College, Hanover, NH.
Meyer, L. B. (1973). Emotion and Meaning in Music. Chicago: University of Chicago Press.
Miller, J. M., Pfingst B. E. & Ryan A. F. (1982). Behavioral modifications of response characteristics of cells in the auditory system. In C.D. Woody (Ed.), Conditioning representation of involved neural functions (pp. 307-317). New York: Plenum Press.
Pantev, C., Hoke, M., Lehnertz, K., Lutkenhomer, B. (1989). Tonotopic organization of the auditory cortez pitch versus frequency representation. Science, 246(4929), 486-488.