Cancel
Showing results for 
Search instead for 
Did you mean: 

Sound Quality Jury Testing

Siemens Experimenter Siemens Experimenter
Siemens Experimenter

 

11.pngThe sound a product makes influences customers’ perception of the product. Ensuring that a product will convey just the right brand values is not only an engineering challenge, but involves human psychology as well.

 

People evaluate the sound of a product subjected to their human opinion. For example, two coworkers on a noise engineering team could listen to their vacuum product and disagree on whether it sounds pleasant or not. This situation would make it difficult to proceed in engineering a better product sound.

 

An objective and empirical approach to solving this issue is needed. For example, a noise engineering team could use several different objective sound quality metrics: loudness, prominence ratio, tone-to-noise ratio, roughness, and fluctuation strength.  Each metric is designed to evaluate a particular aspect of sound (whether it sounds sharp or dull, tonal or broadband, etc.). Usually, a single metric alone cannot predict end user satisfaction with the sound of the product, rather a combination of metrics is needed to fully describe the sound.

 

1.pngFigure 1: Many sound metrics to choose from, but no single metric alone tells the whole story.

A jury test, in which a group of people rate sounds, is used to ascertain the exact combination of metrics needed to fully understand the perception of a product’s sound quality.  Given the sound preferences from the jurors, a mathematical analysis is performed to determine the combination of objective sound metrics that would predict the jury ratings.

 

k.png

 

One possible result of a jury test is a “golden equation” (Figure 2): the golden equation is a weighted combination of sound metrics that predict how the jury will respond to a sound.

 

2.pngFigure 2: An example of a golden equation.

With the golden equation in hand, changes can be made to the product’s sound, and the jury results can be predicted, without having to assembly a large pool of people.  If executed properly, the golden equation can be thought of as capturing the ‘DNA’ of the desired product sound.

 

Jury Testing

 

How is a jury test executed?  What are the main steps?

 

Jury testing consists of the following key steps (see Figure 3):

  1. Measure product sound
  2. Jury: jury selection, attribute rating, and training
  3. Play sound samples to the jury and get subjective ratings
  4. Objective analysis
  5. Correlate the subjective opinion of the jurors with the objective results of the metric calculations. This results in the golden equation.

 3.pngFigure 3: Key steps in jury analysis.

1. Measure Product Sound

 

When measuring the product sound in preparation for a jury test, the recordings should capture the sounds of interest as authentically as possible.  Considerations include the type of recording, the recording environment, and product conditions.

 

Recording Equipment:

 

The sounds that the jury will judge will either be recordings that the test engineer measures or modified versions of recordings.

 

To measure sounds for a jury test, ideally two types of measurement equipment are used (Figure 4):

  • A binaural head or headset
  • A single, free-field microphone

4.pngFigure 4: Binaural headset (left) and single microphone (right).

The two measurement devices should be used simultaneously during recording:

  • Binaural Recording - The jury will listen to the data recorded with the binaural device. A binaural device records sound very similar to how a human would hear the sound: including recording in stereo as well as taking into consideration the filtering effects of the human ear and body.
  • Single Microphone - Data taken with a single microphone will be used to calculate the sound metrics. Most sound metrics were developed using data from a single microphone.

The recording environment should also be carefully considered.

 

Recording Environment:

 

The recording environment should be completely silent other than the product of interest. The environment should also accurately reflect typical placement conditions for the product.

 

For example, if recording sounds from a coffee brewer, it would be wise to put the brewer in an anechoic chamber to ensure the recording does not have any other noise contamination. Coffee brewers are often placed on a hard reflective surface (like a granite counter) and also backed against another hard reflective surface (like a tile wall). Therefore, it may be wise to introduce these reflective surfaces in the anechoic chamber during recording to more accurately replicate operating conditions (Figure 5).

 

5.pngFigure 5: It is important to match the measurement environment with a typical use environment. For example, it may be wise to introduce reflective surfaces when measuring a coffee brewer as brewers typically sit on hard countertops and are pushed against hard reflective walls (such as a tile wall).

The recording device(s) should be put where the listener would usually be. So, in the brewer example, the recording device(s) should be about head level and a typical distance from the brewer. 

 

6.pngFigure 6: It is important to record both data from a single microphone and a binaural device. The mic and binaural device should be located where the user’s ears would be located. The single microphone data will be used to calculate sound metrics and the binaural data will be used for jury replay.

Recording Conditions and Benchmarking:

 

The same conditions should be used for all tests. For example, if recording coffee brewers, the same beans should be used, the same initial water temperature, the same cup, etc.

 

2. Jury Selection, Attribute Rating, and Training

 

Selecting the correct personnel for the jury, and ensuring they are properly prepared, is important for the successful execution of a jury test.

 

Jury Gathering

 

Gathering an appropriate jury is just as important as recording the appropriate sounds. Different demographics of people may have varied subjective opinions of a sound sample. For example, in Figure 7, there are two different listeners of the same sound.

 

7.pngFigure 7: Two different reactions to the same sound.

 

These two listeners have different reactions to the same sound.

 

When selecting a jury, it is important to keep the end user in mind. If you were on a motorcycle exhaust engineering team, you would probably want to gather jurors who own motorcycles or who are interested in purchasing a motorcycle. You would not want to select a juror who is woken up every morning by his neighbor’s revving motorcycle engine.

 

Jury Objectives

 

The way in which the sound is rated is also critical. Continuing the motorcycle example, if the jurors are asked to “Rate this motorcycle for sportiness” versus “Rate this motorcycle for luxuriousness”, different results will yield. It is important to define the adjective with the jurors so they know exactly what you mean by “sporty”. Does it mean that the engine has a lot of horsepower? Does it mean that it can accelerate quickly? Etc.

 

Jury Training

 

Once the jury is gathered they will need to be trained. They should be familiar with the software as well as the types of sounds they will listen to. It is a good idea to have them listen to a few sound samples before having them take the official test. That way the know how long the samples will be, what type of samples they will listen to, etc.

 

The jury should also be comfortable with how the software works so they are not tripped up by the buttons in the official test.

 

In LMS Test.Lab Jury Testing, it is possible to have jurors take a practice test before taking the actual test. It is also possible to select specific recordings to include in the training session vs in the main test (Figure 8).

 

8.pngFigure 8: Sounds can be selected for the main test, training session, or both.

Check out the video below for an example of a training session. The training session is to familiarize the jurors with the software, recordings, and test format.

 

(view in My Videos)

 

Demographics

 

The composition of the jury should be noted.  Any relevant factor, such as experience with a type of product, age, income, gender, etc. should be gathered. 

 

In LMS Test.Lab Jury Testing, it is possible to gather this demographic data and link it to the jurors.

 

(view in My Videos)
 

 

 

An example distribution for product experience is shown in Figure 9.

 

9.pngFigure 9: Example of a jury product experience distribution. The LMS Test.Lab Jury Testing software color codes juror’s demographic answers with information about how reliable the juror is (more information on consistency and concordance later).

Knowing some background information on the jurors allows for a more complete understand of their responses. For example, if two sounds are compared perhaps all jurors younger than 35 years will prefer the first sound and all jurors over 35 years will prefer the second sound.  Essentially, by collecting demographic information from the jury it is possible to determine a link between preference for a particular sound and a demographic.

 

3. Play Sound Samples to the Jury and Get Subjective Ratings

 

The sounds selected for the jury test should be well planned, from both the selection of the sounds to be played, to how the sounds are presented to the jury. 

 

Sound Sample Variation

 

If certain metrics are thought to be important, the selected sounds should have a wide range values for that particular metric. If all the values for the metric are close together, it will be impossible to determine if that metric drives jury perception of the sound.

 

In the top graph of Figure 10 (graph “a”, below), the metric values are too similar to determine if there is a correlation between the metric value and the jury preference. In the bottom two graphs, the metric values are more spaced out. This allows to determine if there is a correlation (graph “b”, bottom left) or no correlation (graph “c”, bottom right).  

 

10.pngFigure 10: In a well-constructed jury test, the sound metrics have sufficient range. Therefore, it can be determined if there is a correlation between jury preference and the value of a metric. In the bottom two graphs, the metric values are the same. The graph on the left shows a clear correlation between the preference and the metric. The graph on the right shows no correlation between the preference and the metric.

 

If there is no correlation between the metric and the jury result, that means that the metric likely does not drive the jury’s perception of the sound and does not need to be included in the golden equation.

 

Sound File Preparation

 

The sounds selected for the jury test can be actual recordings or artificially manipulated sounds. Either or both types of sounds may be included, depending on the objective of the jury test.

 

Recorded Sound Examples:

  • Recordings of your product
  • Recording of competitor product
  • High-end vs low-end of products on the market

Manipulated Sound Examples:

  • Filter / boost certain frequencies (simulate product modifications)
  • Tune range of certain metrics (wide vs narrow range of sharpness)
  • Artificially make all sounds identical in terms of loudness. This helps the listener focus on attributes aside from loudness. For example, a listener may be asked to rate a car sound based on “luxuriousness”. If one of the car sample sounds is dramatically quieter than the others, the juror will likely choose the quiet sample as the “most luxurious” regardless of other sound characteristics. Therefore, artificial volume levelling may be desired to drive focus on metrics other than loudness.

Test Construction

 

To keep the jurors engaged, the test duration should be as not be so long as to fatigue the listeners.  Some guidelines for the test:

  • Keep stationary sounds around 5 seconds
  • Keep non-stationary sounds as short as possible
  • Total test time should not exceed 40 minutes

Long recordings with varied sound content can confuse listeners, as their auditory memory may not be able to retain/comprehend the entire recording.  For example, if evaluating the brewing of coffee, instead of recording the entire brew time (several minutes), individual events like filling and discharge (several seconds) can be broken apart and compared.

 

11.pngFigure 11: Jury test in progress.

In addition, the following should be considered:

  • Use high quality headphones in conjunction with a calibrated replay system to ensure the playback is as similar as possible to the actual product sound
  • Quiet and comfortable room with no distractions – there should not be posters or art pieces distracting the jurors
  • Adequate space for each juror to not be distracted or annoyed by other jurors

Once the jury test sounds are prepared, a rating scheme needs to be selected.

 

Jury Test Format

 

The ratings of the sounds by the jury can be performed in many different ways. The three most popular are:

  • Paired Comparison
  • Category Judgment
  • Semantic Differential

Paired Comparison

 

Paired comparison is perhaps the simplest test type for a novice juror. In a paired comparison test, jurors are presented with two sounds. The juror listens to both sounds and indicates which sound he prefers. Alternatively, a question can also be presented… for example “which sound is more powerful?” The juror then listens to the two sounds and selects the more powerful sounding one.

 

12.pngFigure 12: Paired comparison test.

The disadvantage of a paired comparison test is the execution time. The execution time increases exponentially with each additional sound being evaluated. Each question requires two sounds to be played. It is also recommended to do a consistency check. To do a consistency check, the same sound pair is presented more than once. A consistent juror should always pick the same sound as the preferred sound of the pair.

 

(view in My Videos)

 

Category Judgment

 

For the category judgment test, each sound is played once. The juror then rates the sound on a sliding scale for particular attributes. For example, after listening to the sound a juror may rate how “powerful” it is on a scale of 1-10.

 

13.pngFigure 13: Category judgment test.

Naïve listeners may struggle to rate sounds. For example, if a juror is listening to engine noise, he may rate the very first sound as a 10 for “powerful”. However, if the next sound is even more powerful, then he has already maxed out the scale and is unable to accurately rate the remaining sounds. Therefore, category judgment requires trained jurors with strong product knowledge.

 

(view in My Videos)

 

Semantic Differential

 

The semantic differential test is similar to the category judgment test. However, instead of rating the sound using one adjective, a bipolar pair is presented. For example “weak vs powerful”.

 

14.pngFigure 14: Semantic differential test.

 

This pair, which has opposing attributes can help a naïve jury. The test duration is similar to the category judgment test.

 

(view in My Videos)

 

4. Objective Analysis

 

After the jury test has been performed, and all the votes are in, it is time to correlate the subjective results to the objective sound metrics. 

 

To ensure high quality correlation, jury test results should be double-checked first, before attempting correlation!  Two checks are done: consistency and concordance.

 

  • Consistency – Consistency is checked in two ways: AB consistency and circular triad consistency.
    • AB Consistency: When presented with the same sounds, did the juror always make the same choice? If not, the juror may be confused or not paying attention. For example, if performing an AB Comparison test, the same sound pair may be presented to the juror twice. The juror should be consistent in his choice of the “preferred” sound from the pair. See Figure 15, left.
    • Circular Triad: Circular triad consistency ensures jurors are consistent in their hierarchy of ratings. For example, if a juror rates Sound 1 higher than Sound 2, and rates Sound 2 higher than Sound 3, then the juror should rate Sound 1 higher than Sound 3. See Figure 15, right.
    • 15.pngFigure 15: Consistency is checked in two different ways. For circular triad the greater sign (>) represents that the juror preferred that sound. So, 1>2 means the juror preferred Sound 1 over Sound 2.
    • Concordance – Concordance checks whether an individual juror “follows the pack” with his responses. Were the juror’s selections very different than the rest of the jurors? If a juror does not follow the pack with his responses, he is given a low concordance score.

Consistency and concordance range in value from 0 to 1. The closer to 1, the more consistent or concordant the juror. After running a jury test, the concordance and consistency of each juror can be plotted on a graph like in Figure 16.

 

16.pngFigure 16: Consistency versus concordance for a jury test. Each diamond marker represents a juror.

Explore the different regions of the graph:

  • Green: Jurors who are both very consistent and concordant are in the upper right.
  • Red: Jurors who are not consistent and not concordant, are placed in the lower left. Jurors who fall in this area should probably be excluded from the analysis. If all of the juror results are here, the test may have been poorly designed (perhaps there was not enough variance between the sounds for the jurors to distinguish).
  • Purple: Jurors who have high consistency but low concordance should not immediately be removed from jury analysis. These jurors were consistent with their preferences but did not follow the rest of the pack’s opinion. In this case, an investigation should be done to see if these jurors are of a different demographic from the rest of the jury. For example, maybe all of the highly concordant jurors are ages 50+, while all of the non-concordant jurors were ages 20 and under. Perhaps the age difference in the group is driving the difference in preference.

5. Correlation

 

After performing the jury test, the sound preferences are known subjectively. In the next step, objective sound metric values will be calculated and correlated to the subjective results.

 

For each sample, the subjective results of the jury test are tabulated as well as the objective metric values.

 

17.pngFigure 17: For each sample, the subjective results of the jury test are tabulated as well as the objective metric values.

 

The subjective jury preference can be plotted against the objective metrics to see if a strong correlation exists as shown in Figure 18.

 

18.pngFigure 18: The preference (result from the jury test) is plotted against the sound metric (result from objective analysis). The R^2 value is calculated to determine if there is a strong dependency.

Metrics that are correlated with preference (like loudness and fluctuation strength) can be included in the golden equation. Metrics that are not correlated with preference (like sharpness and tonality) should not be included in the golden equation.  Use the R^2 value to determine if there is a relationship between a metric and the jury preference.

 

Using a regression analysis, it is possible to determine the relationship between all of the metrics and the jury preference.

19.pngFigure 19: Using regression analysis, it is possible to determine the relationship between the metrics and the jury’s preference.

This is the golden equation.

 

Golden Equation

 

The golden equation uses sound metrics to determine how a jury will react to a sound: will the jury like the sound or not? An engineer could then make slight modifications to a product’s sound, record that sound, calculate sound metrics, feed the values into the golden equation, and determine how a listener would react to the new sound (would the listener like the sound more or less?).

 

Future test iterations will not require assembling a jury together to predict results.

 

Questions? Email jacklyn.kinsler@siemens.com

 

Related Links:

Comments
Enthusiast
Enthusiast

Hello,

I have a question about the conclusion “Metrics that are not correlated with preference (like sharpness and tonality) should not be included in the golden equation”.

The question ist:

Is it possible, that there ist nonlinear relationship between metrics and preference (like sharpness and tonality) rather than no relationship?

 

Best regards

Junjun

Contributors