Common Audio Terms
PCM: ‘Pulse Code Modulation’. This is the fully decoded representation of audio data that can be consumed directly by all audio devices. This is the same data format that is typically found in an uncompressed WAV file. The stream consists of a series of frames of data where each sample in each frame represents the current amplitude level for the time slice. Each sample in the PCM stream can be represented as integer or floating point data of various sizes. The most common representation is 16 bit signed integer data. PCM data is commonly operated on as either floating point samples or as signed integer values in larger integer containers (ie: 16 bit samples in 32 bit containers). The extra space allows for temporary overflow or underflow to avoid having to clamp at each processing step.
sample: a single piece of audio data. This is a single value that represents the signal’s amplitude for the current time slice. The frame rate determines how many samples are processed per second and its inverse specifies the length of the time slice for each sample. All samples within a frame occupy the same time slice, but target different channels.
channel: a single part of a frame. A channel contains the stream of data that will eventually end up on a single output speaker or came from a single input source. A channel can be thought of as the container for a sample of data.
frame: consists of a single sample for each channel. For example, in a stereo stream, each frame will have two samples - one for the left channel and one for the right channel. A frame is the smallest unit of data that can be processed by an audio playback context.
frame rate: the total number of frames per second that should be processed by the audio device. Each sample represents the amplitude of the signal strength for the period of the frame rate’s inverse. A common sample rate is 48KHz. Every 48 frames takes 1ms to play back, or each frame occupies ~20.833ns.
sample rate: see “frame rate”. This name is unfortunately historical and should really be called the ‘frame rate’. The name derived from the fact that early audio systems could only play single channel sound which meant a frame and a sample were the same thing. As stereo and higher audio devices became more common, it became clear that a frame could no longer be just a single sample. The term ‘sample rate’ persisted however.
block: the smallest number of frames that can be written to a stream. For PCM data, this is always one frame. For other formats, this may be more than one frame. Most compressed formats are block oriented and can only handle encoding or decoding entire blocks at once. For example, MS-ADPCM uses a block size between 32 and 512 frames per block. The block size is constant for the entire stream.
block aligned: refers to a buffer that contains exactly one or more blocks of data. A partial block is not allowed in the buffer. Since each block is a fixed size for the stream, the buffer’s size in bytes will be a multiple of the stream’s block size.
frame aligned: refers to a buffer that contains exactly one or more frames of data. A partial frame is not allowed in the buffer. Since each frame is a fixed size for the stream, the buffer’s size in bytes will be a multiple of the stream’s frame size. This mostly only applies to PCM data. Note that for PCM data, a frame aligned buffer will also be block aligned since each frame is also a block.
cycle: a single processing pass of a playback context. During each cycle, the mixer will produce a certain amount of audio data (ie: 10ms), then wait out the remainder of that period before starting its next cycle. The playback context will continue to process a fixed amount of audio data at a fixed rate. Consistent cycle times help to prevent device starvation and some types of audio artifacts. The rate of running cycles may not be exactly fixed. If the device is near starvation or overflow, the rate of cycles may speed up or slow down to accomodate the need. A baking context always tries to run its cycles as quickly as possible.
device: a hardware (or simulated software) audio device. This is often connected to a set of speakers. The device is responsible to taking a single stream of audio data and rendering it to a single destination (ie: speakers, file, network, etc). Similarly, on the input side, the device is responsible for being the source of captured audio data. It provides a means of connecting the physical world’s audio to the software world. Only a single device may be in use at any time by any given playback or capture context. Switching devices dynamically on playback is sometimes supported.
sound data object (SDO): an object that encapsulates a single sound asset. This is the basic unit of data that can be operated on by the IAudioPlayback interface. Some or all of the asset may be played. Raw PCM data may not be operated on unless it is encapsulated within an SDO. An SDO’s contents may be modified by the caller as it sees fit, but its size will remain fixed.
sound group: a group of one or more sound data objects and their ranges. A sound group may be used to collect similar related sounds together and allow one of them to be chosen from for a given playback task. An example of this may be footstep sounds. Having the same footstep sound play over and over again can result in irritating output. However, if several variations on a footstep are put into a sound group and chosen from for each footstep instance, it can liven up the scene.
voice: a playing instance of a single sound data object. This instance may be playing only a portion of the SDO or the whole asset. The voice handle is a value that uniquely identifies the playing asset within the playback context. The voice handle is used to make parameter modifications while the sound is playing or to stop the sound. The voice handle does not need to be destroyed and will be internally recycled when its task is done. Even if a voice handle not disposed of after the task is done, attempting to operate on the old voice handle will just be ignored. A playback context may have up to 16,777,216 active voice handles at any given time.
context object: the main interface object for the audio plugins. This is responsible for finding and enumerating suitable audio devices, setting and retrieving global audio processing settings, and performing periodic updates on the audio state. This is the first object that is created when using an audio capture or playback interface and manages and encapsulates the full state of the audio processing engine for that context.
spatial audio: an audio playback mode that simulates audio being produced by an object in a 3D world. This requires 3D position, orientation, and velocity information for each sound emitter object in the world. The 3D positioning information includes the coordinates of the emitter in space, the direction it is facing (usually specified with a ‘forward’ and ‘up’ vector), and its velocity. The velocity information is only used for doppler factor calculations. The 3D simulation is performed by modifying the volume and pitch values for the emitter for each speaker attached to the final output. This gives the illusion of the sound coming from one direction versus another.
non-spatial audio: an audio playback mode that plays a sound as it was originally mastered. This mode will not automatically modify the volume and pitch levels of the sound. This mode effectively maps the sound to the target speakers.
3D audio: see “spatial audio”. This name is historical. This came about in early audio engines that supported spatial or positional audio effects. This name came from the fact that each sound object was given a position in 3D space.
2D audio: see “non-spatial audio”. This name is historical. This came about in early audio engines that supported spatial or positional audio effects as a way to differentiate them from ‘3D sounds’. It doesn’t actually indicate that the sound emitters are specified in 2D space, but simply that it is not a spatial sound.
entity: a spatial audio object. This can be classified as either an emitter or a listener. An entity is just the common name for all audio objects that are involved in a spatial audio simulation.
emitter: an entity that represents the position, orientation, velocity, and potentially some physical characteristics of an object that produces sound in the 3D simulated world.
listener: an entity that represents the position, orientation, and velocity of the user’s avatar in a simulation. This may be represented by the camera, a character or object on screen, or just any point in the simulated world. The output of the 3D simulated audio will be processed to appear as though the audio coming out of the speakers were heard from the perspective of this entity.
output: this represents the logical destination of the final audio produced by the audio processing engine. This output receives a single stream of audio data from the engine and sends it to one or more destinations. The destinations may include up to one hardware device and zero or more streamer targets. The output will perform and necessary conversions to get the engine’s final stream of data to the format required by each destination.
baking: the act of producing final output audio data at the fastest rate possible. This may not target a hardware audio device in its output. This is typically done to pre-process audio data to apply preset effects, filters, volume changes, etc ahead of time to avoid processing overhead at runtime.