This article was last updated 190 days ago. The information in this article may have developed or changed. If it is invalid, please leave a message in the comment section.

Article Summary

This paper explores the underlying mechanisms of sound perception, revealing that pitch is not a single frequency but is determined by the overtone structure. By analyzing the overtone overlap characteristics of an octave, it explains why notes at different pitches can be heard as "the same note" and demonstrates the source of stability in scale construction for the perfect fifth. The study indicates that the auditory system establishes a sense of sound unity by recognizing the proportional relationships of overtones, while the twelve-tone equal temperament, as an engineering compromise, balances the requirements of interval purity and scale closure. These findings provide a fundamental cognitive framework for understanding the principles of musical harmony, acoustic design, and performance practice, revealing the physical essence of sound stability and interval relationships.

Qwen3-14B · 2026-06-18

Contents

1. Why did I write this article?
2. Sound is not a single frequency.
3. Order among sounds: Perception from monotones to polytones
4. The History of Vocal Music and the Exploration of Scales
- 4.1 From Singing Practice to the Scale System
- 4.2 The auditory representation of major and minor keys
5. Summary and Reflection

1. Why did I write this article?

The purpose of writing this article is not to create a vocal tutorial, nor is it a complete overview of music theory. It's more like a record of my own thoughts—an attempt to answer a simple yet complex question:What am I listening to? And why am I hearing these sounds in this way?

If you're like me, curious about sound and music, then please join me on this journey of exploration.

In a previous article, to make the highly subjective judgment of "singing well" more debatable and comparable, I introduced the "three-dimensional theory" framework (see article:Why does a song sound good? A three-dimensional theory analyzes the secrets of singing.This framework is suitable for standing atThe whole song, overall performanceAt the level of analysis, the performance results are analyzed and judged.

But as I continued to use it to break down specific performances, I gradually realized:The three-dimensional theory itself also has very clear boundaries of application.Once we continue to break it down further, for example, by taking "sound"—the focus of the first dimension—and analyzing it separately, new questions immediately arise. For instance, why do some people's voices sound "clean"? Why do some voices, which sound irregular, present a unique "messy beauty"? These questions clearly exceed the scope that the three-dimensional theory can explain.

Because at this point, our discussion is no longer about "whether a song is sung well or not," but rather at a more fundamental level—How is sound itself produced, perceived, and understood?In other words, the issue has shifted from "musical aesthetic evaluation" to...Acoustic structure and auditory perception mechanismsuperior.

It was precisely in the process of constantly asking these questions that I gradually realized that my current predicament did not truly belong to singing style or technique, nor even entirely to music theory, but rather pointed to a more fundamental, yet often overlooked, level—The physical structure of sound, and how humans understand "pitch," "interval," and "stability" at the auditory level. For example, when we say a sound is "clean" or "muddy," what exactly are we listening to? When two sounds have different pitches but are considered to be "different pitches of the same note" (octaves), where does this judgment come from? Why do some intervals naturally feel stable and comfortable, while others are more likely to bring tension or even a harsh feeling?

These phenomena are not rules agreed upon later, but are closely related to the frequency structure and internal relationships of sound, as well as the way the human auditory system works. Therefore, the following discussion will deliberately avoid those...Memorize the rules and master the skillsInstead of using the common approach of starting with music, we attempt to reinterpret the familiar concepts of music along a more fundamental path:Starting with the physical structure of sound → to the formation of auditory perception → and then to why intervals and scales are established in the way they are today.

In this process, musical notation systems such as numbered musical notation, note names, and key signatures will no longer be regarded as the starting point for musical understanding, but rather as an abstraction and compression of complex acoustic reality based on long-term auditory experience.

2. Sound is not a single frequency.

2.1 Pitch is not a result of the perception of a single frequency.

Before discussing intervals, tonality, or scales, we almost always unconsciously accept a premise:A note corresponds to a pitch; and pitch is essentially a frequency. This premise seems very reasonable in the notation system: whether it is "1, 2, 3" in the numbered musical notation or the position of notes on the staff, they constantly reinforce an intuition that music is composed of a series of clear and stable "points", each of which can be precisely marked, named and reproduced.

But once you return to real auditory experience, this intuition immediately begins to waver: the same pitch sounds vastly different when played on a piano, violin, or sung; the same human voice can sound "stable" and "clean" when sung by different people on the same note, while others sound "floaty" and "scattered"; even when the pitch is perfectly accurate, we will still instinctively judge that the sound is "wrong".

If "pitch = frequency" were truly sufficient to describe what we hear, then these differences would be difficult to explain. Because in this model, as long as the frequency is correct, it should sound like "the same note." Reality is clearly not like that. This is also a kind of implicit confusion that many people encounter when learning music or singing: they know exactly which note they are singing, and even in terms of musical notation or the meaning of the tuner, they haven't sung it wrong, but their mental image of "what this note should actually sound like" remains vague. This vagueness isn't a problem of comprehension ability, but rather stems from an oversimplified assumption—We think we are listening to "pitch", but in reality, hearing is never just facing an isolated frequency.

When a sound is produced, whether it's an instrument or a human voice, it's not a "point," but rather a whole unfolding across the frequency spectrum. Pitch is just the part of this whole that's easiest for us to name and for the notation system to capture, but it's not the whole.

This is precisely why we constantly use words unrelated to "frequency" in our real-world auditory experiences: clean, muddy, thick, thin, bright, dark, sandy, rough... These words are not rhetorical devices, but rather attempts to describe a certain...Auditory differences beyond a single pitchHowever, before we have a suitable concept, we can only use our feelings to refer to it.

Therefore, before continuing to discuss intervals, scales, and even tonality, there is one issue that must be addressed:If what we're hearing isn't just "pitch," then what are we actually listening to?

2.2 Reasons for different listening experiences despite the same pitch

If the previous section addressed an intuitive question—that we are not just listening to pitch—then we must now confront a more realistic fact: even when the pitches are exactly the same, the differences between sounds are still real and significant.

This difference isn't an "advanced perception" that only emerges after musical training. Even without any music theory background, most people can immediately distinguish that some sounds are more "stable" while others are more "floaty"; some sounds are focused and clear, while others are blurry and difficult to grasp. More importantly, this judgment often occurs before "whether it's out of tune"—that is, before the pitch is clearly determined to be right or wrong, the ear has already made another layer of evaluation of the sound.

This illustrates that while we hear pitch, we are also receiving other information, and this information precisely determines the overall auditory experience. If "pitch = frequency" were sufficient to describe everything we hear, then as long as the frequencies are the same, there shouldn't be such a significant difference in auditory experience. Reality, however, repeatedly reminds us that the difference doesn't come from pitch itself, but from factors beyond pitch. In everyday discussions, people often use "different timbre" to summarize this difference, but this statement is more like a label of a result than an explanation—it doesn't answer a more fundamental question: if the pitch is the same, what exactly makes these sounds so different?

When two sounds are exactly the same in pitch, the difference perceived by hearing can only come from other features within the sounds. These features are not marked by musical notation or note names, but they always exist in reality during the sound production process: some sounds seem to be concentrated in a specific location with a clear overall outline; while other sounds seem diffuse with blurred boundaries, making it difficult to form a stable auditory image.

This difference is often intuitively described as "stable" or "unstable." The sense of stability here isn't a psychological feeling of peace or relaxation, but rather a very direct auditory experience: whether the sound presents a clear and sustainable overall form. When the internal changes of a sound are too frequent, with certain components fluctuating in strength and unpredictable, it becomes difficult for the ear to establish a stable perception of it. As a result, even if the pitch is correct, the sound may still be perceived as "unstable."

In contrast, when a sound exhibits a certain internal consistency in time and structure, the auditory system can quickly accept it and regard it as a complete sound unit. Even if the sound is not "beautiful" and even has a noticeable roughness, it may still be perceived as stable and reliable. This is why, in real-world listening experience, some sounds with a noticeable "sandy" or rough texture are more easily accepted than some superficially clean but erratic sounds.

This leads to another frequently mentioned but rarely thoroughly analyzed phenomenon—the perception of noise. As irregular components in a sound gradually increase, the perceived pitch often becomes blurred, the sound boundaries loosen, and the overall listening experience becomes more tiring. The problem isn't whether the sound "contains noise," but whether these irregular components begin to interfere with the auditory perception of the overall structure. Once this interference reaches a certain level, the auditory system struggles to reliably grasp the part that should serve as the anchor point, and the perception of pitch naturally wavers.

At this point, one thing becomes clear: the reason why the same pitch can sound completely different is not because we misheard the pitch, but because we are never just listening to pitch. The differences stem from the different internal structures of the sound, and these structural differences are the root cause of auditory perceptions such as "stable," "loose," "clean," and "rough."

This makes the question more specific. If these auditory differences are not subjective impressions, but rather stem from the actual internal structure of sound, then what exactly are these structures? And how do they work together to form the "whole sound" we perceive?

2.3 Overtones: The reason why sounds can be "heard as a single tone".

In the preceding discussion, we have repeatedly touched upon the fact that sound in reality is never a "single frequency." Whether it's a human voice, a piano, or a guitar, as long as it's not a pure sine wave synthesized electronically, it is necessarily the result of multiple frequencies existing simultaneously at a physical level. It's just that most of the time, we are not consciously aware of this.

This is where the problem begins. If every real sound contains a large number of different frequencies, how exactly does the human auditory system, in such a complex acoustic reality, still reliably determine "this is a sound," "how high it is," and "whether it is stable and pleasant to hear"? In other words, what the auditory system truly needs is never more detailed information, but rather a structure that can be uniformly understood—"overtones" are the core of this unified structure.

From a physics perspective, when a sound source vibrates at a fundamental frequency, it is almost impossible for it to undergo only a single, ideal vibration. The actual vibration process naturally produces a series of additional vibrational components. The frequencies of these components do not appear randomly, but are strictly arranged in integer multiples of the fundamental frequency: 2, 3, 4, etc. These frequency components, coexisting with the fundamental frequency and appearing according to rules, are overtones.

Any single tone we hear is actually a complex tone that includes the fundamental tone and overtones.

However, if we only stay at this definition level, overtones can easily be misunderstood as a "factor that complicates sound," as if pitches, which are already difficult to distinguish, are stuffed with a bunch of extra frequency information. But in reality, the existence of overtones is precisely the key to making the auditory world understandable.

Imagine an extreme case: if the frequency components within a sound are completely chaotic and irregular, then it will only be perceived as noise. It has no clear pitch center, cannot provide a sense of stability, and cannot be naturally incorporated into a musical system. The importance of overtones lies not in their "abundance," but in their "order."

This order is embodied in a very simple yet extremely crucial feature: all overtones are integer multiples of the fundamental frequency. It is this highly regular proportional relationship that allows the auditory system to compress and understand the originally complex set of frequencies as a whole. We do not perceive each frequency individually as "I am now hearing 440Hz, 880Hz, 1320Hz...", but rather form an intuitive judgment directly: this is a sound with a clear pitch and a stable internal structure.

Because the overtone structure is unified and predictable, the auditory system naturally anchors its attention to a particular "core," which is what we commonly refer to as pitch perception. In other words, pitch is not the result of a single frequency being "heard" in isolation, but rather the perceptual center to which the entire overtone structure points.

Looking back from this perspective, the meaning of our common saying "this is a complete sound" becomes much clearer. The so-called completeness does not mean that it only contains one frequency. On the contrary, it is because it contains a complete set of regular overtone structures that it sounds focused, stable, and solid.

This also explains an often overlooked fact: whether a sound is "pleasant to listen to" depends not on whether it is simple enough, but on whether it is unified enough. When overtones are clearly arranged and in stable proportions, the sound will be clean and focused; when overtones are present but randomly distributed, the sound will be rough and blurry; and when overtones are deliberately distorted or destroyed, the sound will have a strange, tense, or even distorted feel.

These differences are not merely aesthetic preferences, but rather the natural response of the auditory system to different internal structures. Therefore, when we truly place "overtones" back in their proper place, they cease to be an additional acoustic term and become a key to understanding auditory order. It is with this key that we can further understand why musical instruments, the source of modern music, can produce such a diverse range of sounds.

From a physical perspective, the sound produced by any musical instrument originates from vibration; as long as a stable pitch perception exists, it is inevitably accompanied by a complete set of overtone structures. Different musical instruments, due to their different sound production principles, vibrate in different ways, thus forming their own unique overtone distributions. These differences ultimately manifest as the differences in timbre that we perceive, and constitute the fundamental reason why "the same note can sound completely different."

In most musical practices, the concept of overtones is not explicitly understood. However, when we try to explain timbre differences, interval stability, or even why scales exist in their current form, overtones become an unavoidable underlying mechanism.

A brief explanation of the terminology is necessary here—the terms "fundamental tone" and "overtone" used earlier are more common in the field of music; while in physical acoustics, the same phenomenon is usually described as...harmonic.

In physics, the lowest frequency produced by the vibration of an object is called the first harmonic, while the second, third, and so on correspond to higher multiples of the frequency components. Because of the coexistence of these two naming systems, in the field of music, to more clearly distinguish between the "pitch center" and "subordinate structures," the first harmonic in a physical sense is often referred to separately as...fundamental toneStarting from the second harmonic, they are successively called...First harmonic, second harmonicAnd so on:

Regardless of the naming convention used, they all refer to the same objectively existing frequency structure. Once this is understood, the pitch perception, stability, and "why a sound is perceived as a single tone" discussed earlier are no longer abstract or artificially constructed concepts, but rather natural results built upon this ordered vibrational structure.

3. Order among sounds: Perception from monotones to polytones

3.1 Why is an octave sometimes perceived as "different pitches of the same note"?“

In everyday musical experience, there is a phenomenon that almost everyone accepts by default, but rarely asks about: when the frequency of a note doubles, for example, from 440Hz to 880Hz, and then to 1320Hz, we will say without hesitation, "It is the same note, just an octave higher."

To make the auditory concept of an octave more intuitive on a physical level, the following diagram shows the actual frequencies corresponding to the note names (C to B) in the twelve-tone equal temperament at different octaves:

This judgment seems like a rule of music theory, as if it were artificially prescribed. But if we go back to the previous discussion about overtones, we will find that this auditory "classification" actually has a very clear physical and perceptual basis.

Let's first look at the internal structure of sound. Suppose there is a sound with a fundamental frequency of f. Its overtones will naturally be arranged at integer multiples of 2f, 3f, 4f, etc. (This point has been repeatedly mentioned in the previous section). However, when this sound is raised by an octave, and the fundamental frequency becomes 2f, its overtone structure will not be "reshuffled". Instead, it will shift upwards as a whole: new overtones will appear at positions 4f, 6f, 8f, etc.

The key point here is:The numerous overtones in the original sound will highly overlap with the overtones in the new sound.The original 2f has become the new fundamental tone; the original 4f corresponds to the new second overtone; and the original 8f still exists in the new structure. In other words, from a spectral perspective, these two sounds are not two unfamiliar objects, but rather...The same type of structure that is highly similar in internal composition but differs in overall scale.

It is this highly consistent arrangement of overtones that allows the human auditory system to make a very stable judgment: "They belong to the same pitch category, just at different pitch levels." From an auditory perception perspective, the human ear doesn't analyze each specific frequency individually; instead, it quickly grasps the stable patterns within the structure. When the overtone structures of two sounds are highly consistent in proportion, the auditory system automatically classifies them as the same category, adding a simple dimensional difference—high or low.

This also explains an interesting phenomenon: the octave is special not because the mathematical relationship of frequency doubling is particularly mysterious, but because...Of all intervals, only the octave can preserve the overall shape of the original overtone structure to the greatest extent.Other intervals, even if they also follow simple integer ratios, will inevitably introduce more non-overlapping frequency components, thus creating a more obvious sense of "separation" in the listening experience.

Therefore, when we say "an octave is different pitches of the same note," we are not using a metaphor, but describing a real auditory compression effect. When faced with two highly similar overtone structures, the auditory system chooses to ignore the differences in absolute frequencies and emphasize the unity of the structure itself.

Looking back at the design of "note names" from this perspective, it all makes perfect sense. The reason why C2, C3, C4, C5... share the same note name is not to simplify notation, but because they are perceived as different ways of unfolding the same type of sound at the auditory level.

To go further,An octave is not the starting point of the music system, but rather an answer actively provided by the auditory system.Before humans began constructing musical scales and naming pitches, hearing had already categorized the structure of sounds. Music theory simply followed this existing perceptual pathway, fixing, abstracting, and recording it.

Once this is understood, the "octave sense" is no longer a rule that needs to be memorized, but a phenomenon that can be intuitively understood: when two sounds are sufficiently unified in their internal structure, humans will naturally hear them as "the same sound".

3.2 When multiple sounds occur simultaneously: How does hearing establish a sense of unity?

In real musical settings, it's almost impossible to hear only a single, isolated sound. Whether it's listening to a song, playing a piano piece, or simply using a guitar to accompany vocals,Multiple sounds appearing simultaneouslyThat is the more common and natural state in music.

When multiple sounds are heard simultaneously, the auditory system is no longer simply faced with "what kind of sound this is," but rather a subtle change occurs:Will these voices merge together, or will they interfere with and pull at each other?

At the physical level, this meansMultiple complete frequency structuresSuperimposed on the same timeline: each sound contains its own fundamental tone and a whole set of overtones; when they exist simultaneously, these frequency components work together on the auditory system. In terms of sheer quantity, this is clearly more complex than a single tone. But for hearing, the real challenge isn't "more frequencies," but rather:Can these frequencies be organized into a comprehensible whole?.

The auditory system does not analyze the structure of these sounds as completely independent information sources one by one, but rather quickly determines whether there is some kind of connection between them.Relationships that can be unifiedThe core of this judgment lies not in a specific frequency, but in...Are there regular correspondences between multiple sets of overtone structures?.

When a clear and stable proportional relationship can be found between the overtone structures of multiple sounds (such as a 2:1 ratio between octaves, or the frequency relationship of thirds and fifths in common chords), the auditory system tends to perceive them as different components of the same whole. Even if these sounds are not exactly the same in pitch, they will sound natural and stable, as if they "should exist simultaneously." This proportional relationship, like the integer multiples between overtones, provides the auditory system with a reliable structural anchor.

Conversely, if these overtone structures lack such correspondence, it becomes difficult for the ear to establish a unified framework for understanding. In this case, the sounds don't truly blend, but rather present a sense of separation, tension, or even chaos. In other words, when multiple sounds occur simultaneously, the human ear isn't simply "hearing many sounds," but rather subconsciously comparing them:Can these structures be integrated into the same sensing system?.

This comparison is completed almost instantaneously. The auditory system automatically searches for alignable frequency components and simple, stable proportional relationships. As long as enough anchor points are found, the auditory perception tends to stabilize; once these anchor points are insufficient, tension arises.

From this perspective, intervals are not simply abstract distances between several pitches, but rather...Similarity and coordination among multiple overtone structuresThe so-called "harmony" and "conflict" are essentially direct feedback from the auditory system to this structural relationship. It is precisely under such a comparative mechanism that humans have gradually developed a stable intuition for certain intervals. The reason why these intervals appear frequently in music is not accidental, but because their overtone structures are easily integrated by the auditory system and easily form a sense of unity.

Once you understand this, when you observe the chords and intervals that are frequently used in music, you will find that they are special not because they are artificially defined, but because they happen to fit the positions where the ear can most easily achieve harmony—this is the root of musical harmony.

3.3 Fifths and Scales: How Proportional Relationships Construct the Musical Framework

In the previous section, we discussed how the auditory system seeks clear and stable proportional relationships when multiple sounds occur simultaneously, thus integrating complex vibrations into a comprehensible whole. Among all interval relationships,Fifth degreeThis stable proportion is the most typical and easily perceived in musical practice.

So-calledPure FifthThis refers to a very simple and stable frequency relationship between two pitches:The frequency of the upper note is approximately 3/2 times that of the lower note.In other words, if the fundamental frequency of a certain note is f, then the frequency of a note that is a perfect fifth higher than it is approximately f × 3/2.

This ratio is not arbitrarily defined, but rather directly derived from the overtone structure of sound. Taking C and G as examples, in the overtone sequence of C, the third overtone is very close to the fundamental frequency of G; while in the overtone sequence of G, several components can form simple integer multiples of the overtones of C. Because there is a large overlap and matching between the two sets of overtone structures, the auditory system hardly needs any additional "calculation" when processing these two sounds, and can naturally integrate them into a stable and harmonious whole.

In other words, the reason why a fifth sounds "stable" is not because we have studied music theory, but because at the level of overtones, it is inherently a kind of...The structural relationships most easily understood by the auditory system.

This relationship becomes even more intuitive when viewed in the context of actual frequencies. Take the common middle C (C4) as an example; its frequency is approximately 261.63 Hz. Adding a perfect fifth upwards, the corresponding frequency is: 261.63 × 3 / 2 ≈ 392 Hz. This value falls precisely within the frequency range of G4 (approximately 392 Hz). In other words,C4 → G4 This pair of pitches is naturally close to a 2/3 relationship in terms of physical frequency, rather than being the result of being "tuned" later.

Following the same logic, adding pure fifths upwards will result in a continuous chain of fifths:C → G → D → A → E → B → …

In this chain, each step has the same meaning:The frequency of the next note is approximately 3/2 of the previous note.For example, G is a perfect fifth higher than D, and D is a perfect fifth higher than A. If you refer to the frequency table of the notes listed above and do some simple calculations, you will find that the relationships between these notes are numerically continuous and consistent, rather than scattered or accidental.

It's important to note that this logic of "consecutive fifths" does not require all pitches to fall within the same octave. In actual music, pitches are often brought back into the appropriate range by raising or lowering octaves, but when the auditory system judges fifth relationships, it always focuses on the interval between the notes.Matching the proportions of overtonesIt is not about the absolute high or low frequencies. Therefore, even if two notes are separated by one or more octaves, as long as their frequency ratio is close to 3/2, the stability of a fifth can still be clearly perceived.

From this perspective, the importance of the fifth is not merely a "tone name," but a fundamental logic that permeates the construction of single notes, multiple notes, and even scales. It is this highly stable and easily integrated proportional relationship within the overtone structure that makes the fifth an irreplaceable basic piece of the musical puzzle, providing a clear and natural starting point for the formation of subsequent scale systems.

Incidentally, the five degrees of separation describe...The auditory proximity or distance between pitchesFor example, C and G, and G and D are naturally similar in sound; however, the piano keyboard does not arrange the pitches according to this auditory order, but rather these note names (C, D, E, F, G, A, B).In a way that suits both hands playing and memorizationProjected onto a linear key layout—the keyboard prioritizes the seven most frequently used and easily auditory-integrated note names as the basic framework of the white keys, while inserting the remaining pitches as black keys, thus satisfying both playability and scale integrity requirements within a limited physical space:

While seemingly conflicting, these two elements actually serve entirely different cognitive and practical needs. It is precisely within this tension between auditory structure and expressive tools that humanity attempts to construct a...A complete and closed scale systemOnly then did new problems truly emerge.

Additional knowledge: Twelve-tone equal temperament and fifth overflow

In the previous section, we discussed how the superposition of consecutive fifths naturally constructs a stable pitch sequence. Each superposition of a perfect fifth results in a highly harmonious interval; the matching between overtone structures makes this relationship almost imperceptible and acceptable without learning. From the perspective of individual intervals, this chain of fifths is almost "perfect."

But that's precisely where the problem lies:This perfect chain of fifths cannot naturally close into a finite musical scale system in reality. If we start from C and repeatedly add perfect fifths upwards, we get the following sequence: C → G → D → A → E → B → F# → C# → G# → D# → A# → F → > C′ (an octave higher)..

Throughout this process, the logic of each step remains consistent: the frequency of the next note is approximately 3/2 of the previous note, and audibly, these adjacent notes maintain a familiar and stable fifth interval relationship.

However, after we complete twelve such superpositions, the theoretical return to that... An octave higher, C (C′),existThe frequency does not perfectly coincide with the exact octave of the starting C.This phenomenon, where "it's just a little bit off, but it just won't align," is called...Fifth degree overflow (Pythagorean comma)It reveals a very crucial fact:A chain of fifths based solely on perfect proportions cannot form a self-consistent closed loop within a finite musical scale.:

In other words, if we want to preserve the stability of the fifth while also allowing the scale to cycle completely within an octave, we must make a compromise somewhere.

The twelve-tone equal temperament is a compromise born out of such practical constraints—it no longer attempts to maintain the most "pure" integer proportion for each interval, but instead chooses to start from the whole:Divide an octave (the interval where the frequency doubles) into twelve equal steps.This allows us to accurately return to the starting point of an octave after twelve identical interval changes, starting from any note. This can be represented by a clockwise circle as shown in the diagram below:

The cost of this is obvious: in the twelve-tone equal temperament, the fifth is no longer a perfectly pure 3:2, and the third no longer perfectly matches the proportions of natural overtones. Every interval has been "slightly adjusted." But these deviations are evenly distributed throughout the scale, small enough that the auditory system can still smoothly integrate the overtone structure without producing noticeable discomfort or conflict.

From an auditory perspective, this is a...Very clever engineering compromiseThe octave relationships remain clear; most fifths are still very close to the ideal proportions; the entire scale system has achieved unprecedented unity and flexibility.

For this reason, the twelve-tone equal temperament was not designed to be "purer," but rather to be "more usable." It allows music to modulate freely, keeps keyboard instruments playable in all keys, and provides a stable foundation for complex harmonies and polyphonic writing.

From this perspective, the twelve-tone equal temperament is not a denial of natural proportions, but a rational choice made after acknowledging the limitations of reality. It moderately connects humanity's auditory preference for harmonious intervals with the need for unity and operability in musical practice, thus laying the foundation for the modern music framework we are familiar with today.

3.4 The Auditory Logic from Octaves to Chords

In the previous sections, we focused on the special characteristics of octaves and fifths: they maintain a high degree of integer multiple relationship in the overtone structure, making them easily integrated into stable and natural intervals by the auditory system. In addition to octaves and fifths, there are many other commonly used intervals in music, such as fourths, thirds, and various major and minor thirds, each with its own characteristics in overtone matching.

Taking a fourth as an example, its frequency ratio is approximately 4:3, while a third is approximately 5:4 or 6:5 (major and minor thirds). Although these ratios are not as perfectly regular as octaves or fifths, they still form relatively simple integer multiple relationships. Aurally, this means that some corresponding points can still be found between their overtone sequences, thus producing a certain sense of stability and harmony. It is this degree of matching that allows fourths and thirds to appear naturally in harmonic construction and melodic progression without causing obvious tension or chaos.

Looking further, the perception of harmony and chords is essentially an intuitive judgment of matching multiple sets of overtone structures. When multiple notes sound together, the auditory system automatically seeks integer ratios between frequencies to form a unified perceptual whole. The simpler the ratio and the clearer the matching, the more stable and harmonious the sound; when the ratio is complex or there is severe overtone conflict, the sound will seem tense or unstable. This is why certain chords are widely used in different cultures and musical systems, while some extreme interval combinations can sound "harsh" or difficult to integrate into the overall sound.

In conclusion, octaves and fifths are merely the most extreme and intuitive examples in the world of intervals, making it easy for us to understand how hearing integrates overtones. The existence of other intervals demonstrates the flexibility of the auditory system when faced with complex sounds: it does not require perfectly regular integer multiples, but rather relies on relatively clear proportional relationships to form understandable harmonic structures. This mechanism not only explains the stability of commonly used intervals but also provides the physiological and physical basis for the perception of chords, modes, and melodies.

4. The History of Vocal Music and the Exploration of Scales

4.1 From Singing Practice to the Scale System

In early human vocal practice, there were no fixed note names, scales, or systematic theoretical descriptions. Whether singing or using instruments, the core basis remained the same—Is the auditory perception valid?Through long-term practice, people have gradually discovered that certain interval combinations sound stable, natural, and are easily accepted and remembered; while other combinations bring a sense of tension, discomfort, or even harshness. This seemingly "empirical" judgment is not purely an aesthetic preference, but is closely related to how the human auditory system perceives the internal structure of sound.

From the perspective of auditory mechanisms, when the frequency relationship between two sounds presents a relatively simple and stable ratio, the overtone structure they produce is more easily integrated into a whole by the human ear, thus forming a "harmonious" and "stable" auditory experience. Conversely, combinations with complex ratios and more conflicting overtones are more likely to induce feelings of tension and instability. It is through this repeated auditory feedback that early vocal practice gradually developed a set of...Unnamed yet highly consistent judgment criteriaWhat kind of sound is "correct" and what kind of combination is "harmonious"?

At this stage, the transmission of music does not rely on abstract rules or explicit concepts, but is more based on...Auditory Consensus and Body ImitationAbove all, the singer doesn't explain "this is a certain interval" or "this conforms to a certain proportion," but rather demonstrates, sings along, and repeatedly corrects, gradually bringing the sound closer to a stable result that is commonly accepted by the group. It is precisely because the human auditory system has a high degree of consistency in judging stable overtone structures that this oral transmission method, which is centered on auditory perception, has been able to continue effectively for a long time without theoretical support.

The ancient Chinese pentatonic scale system (pronounced gōng shāng jué zhǐ yǔ, representing the names of five different notes in the Chinese pentatonic scale, equivalent to 1, 2, 3, 5, and 6 in modern numbered musical notation) is the result of long-term refinement of vocal practice centered on auditory stability.

A seemingly simple yet extremely crucial fact is:This system only retains numbers 1, 2, 3, 5, and 6, while deliberately omitting numbers 4 and 7. This is not because the ancients "did not know" these two sounds, but because, in vocal practice where auditory stability is the primary goal, they are not the optimal choice.

From the perspective of the physical structure of sound, this selection process is actually quite natural. Early pitch relationships were not built from abstract scales, but rather stemmed from...Pure fifth degree relation (frequency ratio 3:2)The process involves continuous layering and comparison. When a note is repeatedly layered upwards or downwards by a perfect fifth, a set of pitches that are highly matched in their overtone structure and easily blend together will be generated first. Compressing these notes back into the same octave will naturally result in a stable set of notes:1, 2, 3, 5, 6.

In contrast,4 and 7 are not notes that are naturally generated through this "perfect fifth path".These tones have a lower degree of matching with the tonic and core stable tones in terms of overtone ratio, making them more prone to creating tension and instability. The 7th note, in particular, strongly "points" to the tonic, forcing the melody to return to its source; while the 4th often creates a tension between the 3rd and 5th. These types of tones became important driving forces for melodic development in later tonal music, but in early vocal practice, which focused on smooth flow and auditory comfort, they were not suitable for prolonged use.

Therefore, the Gong, Shang, Jiao, Zhi, and Yu scales are not a "simplified scale," but rather a...Active selection of the most stable pitch relationships perceived by hearingIt retains the notes that are most easily integrated in the overtone structure and least likely to create internal conflict, thus forming a sonic space that can flow freely without being forced to revert. It is in this space that the melody can unfold naturally without being constrained by excessive tension.

This system doesn't originate from precise calculations of absolute frequencies, but rather from countless attempts, corrections, and imitations, gradually bringing the sound closer to those "smoother and most stable" positions. In this sense, the five notes of the pentatonic scale (Gong, Shang, Jiao, Zhi, Yu) don't record some abstract rule, but rather the human auditory system's long-term practice and understanding of...Overtone stabilityThe collective choice made.

In the development of Western music, we can also see a path highly similar to the Gong, Shang, Jiao, Zhi, and Yu scales, but leading to a different end. Early church modes and folk melodies were also built upon the stable relationship between perfect fifths and overtones: by continuously superimposing and retracting fifths, a set of pitches that were highly integrated and harmonious in the human ear were formed. This process is essentially no different from the Eastern method of generating a pentatonic scale through fifths.

The real fork occurs when...Polyphonic music becomes the core mode of expressionThen, with the emergence of harmony, counterpoint, and functional accompaniment, music was no longer just about "how smoothly the melody flows," but began to undertake a new task:How to create a sense of direction and belonging when multiple voices are present simultaneously?At this point, the pentatonic system consisting only of 1, 2, 3, 5, and 6 is no longer sufficient because it lacks a strong enough "pull" and "return mechanism".

It was under this demand that the previously auditory instability of the notes 4 and 7 was systematically introduced. In their overtone structure, they form a more complex proportional relationship with the tonic, generating significant tension: 4 tends to release towards 3 or 5, while 7 strongly points towards the tonic 1. It is this "instability" that provides polyphonic music with a clear directionality, enabling harmonic progressions to form a "departure-deviation-return" structure. This mechanism was ultimately fixed in the major and minor key systems, forming the framework of modern tonal music.

From this perspective, the Western heptatonic scale is not a simple "extension" of the pentatonic scale, but rather a...Structural choices made for harmony and tonalityIt retains the stable core generated by the fifth degree (1, 2, 3, 5, 6), while introducing 4 and 7 to create tension and direction, thus forming a sound system that can both remain stable and be continuously propelled.

Therefore, neither the Eastern musical scale of Gong, Shang, Jiao, Zhi, and Yu, nor the Western major and minor keys, are products of theoretical design. Rather, they are two different “balance solutions” naturally selected under the dual constraints of the human auditory system and musical practice—one leaning towards the stability of melodic flow, and the other towards the directionality of harmonic structure.

4.2 The auditory representation of major and minor keys

In the previous section, we mentioned that major and minor keys were not a designed system, but rather the result of gradual stabilization through long-term singing practice. So, when these structures are actually sung and heard, what exactly is the ear perceiving? To answer this question, we need to temporarily set aside history and naming itself, and return to the most direct auditory experience.

In modern vocal music, major and minor keys are almost ubiquitous.When the melody is sung and listened toThe pitch arrangement in a major scale is usually integrated by the ear into a bright, stable and open overall feeling; while the minor scale, due to the changes in some interval relationships, is more likely to present a soft and slightly introspective auditory color.These differences do not stem from the addition of emotional labels, but rather from the varying degrees of harmonic matching in pitch arrangement and the auditory system's natural response to stable structures.

Taking the most common C major scale as an example, the scale consists of eight notes: "CDEFGABC". Semitones appear between EF and BC, and the rest are whole tones, which correspond exactly to a set of white keys on a piano.

This arrangement of pitches ensures that the proportions between octaves, fifths, fourths, and even thirds are highly matched with the overtone sequence, resulting in a stable and natural sound.

In contrast, C minor consists of eight notes: CDE♭ FGA♭ B♭ C (where the notes with ♭ form a semitone with the preceding note). This arrangement of notes slightly alters the proportions between intervals, resulting in a softer, slightly melancholic, or introspective sound, a stark contrast to major keys. This subtle difference is also an important tool for composers to evoke emotions in melody and harmony.

The charm of major and minor keys lies not only in the progression of simple melodies but also in the construction of harmonies and chords. By choosing different combinations of intervals, vocal and instrumental performances can create rich emotional layers within the same tonality: simple triads can convey a bright or warm feeling, while extended chords and modal variations can bring about tension, anticipation, or mystery. All of this depends on the matching relationship between the notes and overtones in the scale.

For this reason, major and minor keys have become the main framework for modern Western vocal composition and performance: whether it is opera, art song, or popular music, most melodies and harmonies are based on these two modes. They are both the result of historical exploration and a natural product of the combination of auditory perception and musical practice.

5. Summary and Reflection

Looking back, we started by discussing the physical properties of sound, gradually extending to how hearing perceives pitch, intervals, and the relationships between multiple sounds. The overtone structure within a single sound allows us to "hear" a complex set of vibrations as a single tone; while special intervals such as octaves and fifths, relying on stable and simple frequency ratios, allow the auditory system to integrate different pitches into a natural whole with almost no additional judgment. It is this structural stability that makes certain intervals sound exceptionally harmonious, bringing an almost instinctive sense of comfort.

When we shift our perspective back to historical vocal practice, we find that this mechanism is not an abstract theory, but rather an empirical result repeatedly verified by humankind. Whether it's the Chinese pentatonic scale (Gong, Shang, Jiao, Zhi, Yu) or the gradually developing major and minor key system in the West, humanity has consistently been selecting from various elements through centuries of singing and instrument experimentation.It is easily integrated by auditory senses, easy to remember, and even easier to pass on.The combination of intervals. The modern vocal system is not a set of rules designed out of thin air, but the result of long-term practice, gradually settling down according to the natural preferences of the auditory system—it's not "who dictates what sounds good," but rather that our ears have made similar choices time and time again.

It is worth noting that these patterns do not depend on cultural background or theoretical knowledge, but rather stem from the physical structure of sound itself and how the human auditory system works. This is why certain intervals and structures can transcend geographical boundaries and eras, repeatedly appearing in different musical systems and being perceived and understood by different people in similar ways.

Therefore, what this article truly aims to convey is not a new set of vocal conclusions, but rather a perspective on understanding music: the harmony and beauty in music are not accidental aesthetic preferences, but are deeply rooted in the physical characteristics of sound and our own auditory mechanisms. When you realize this, and then go back to vocal exercises, listening to songs, or even composing melodies, those subtle differences that were previously only judged by "feeling" often become clearer and more credible—not because feeling has been replaced, but because you have begun to understand.Why do I feel this way?

Having finished this article, I want to reiterate: it's not a vocal tutorial, nor is it a comprehensive guide to music theory. Rather, it's more like my personal path of exploration—an attempt to understand what we're actually listening to when we listen to music. For professionals, some of the content may be too basic; for general readers, some details might be a bit obscure. But regardless, if you're willing to read patiently, you'll see that the development of sound, intervals, and scales is actually closely linked to physical structure, auditory psychology, and historical practice.

The value of this article may not lie in making you sing better immediately or understanding all music theory, but in providing a way of thinking: attempting to connect scattered musical phenomena and knowledge into a more complete and "understandable" whole. Hopefully, it can offer some clues or inspiration for your curiosity about sound.

📚 Series of articles: The Awakening of the Voice - Basics (1/3)

12 3

📌 Content Structure Hints:

This content belongs to "Music and Sound Cognition Thematic MapThis is part of the document; you can view the full content path here: Music and Sound Cognition Thematic Map .

Share this article