UTILE Pronunciation training – Practical guidelines

Author: Tanja Kocjančič-Antolík, Ph.D.

The usage of ultrasound tongue imaging (UTI) in second language (L2) speech sound remediation and learning has gained a significant amount of interest in recent years (Bliss, Abel & Gick, (2018); Kocjančič Antolík, 2020).

UTI is a safe, non-invasive, and user-friendly method of visualizing tongue in real-time while speaking (Stone, 2005). It captures a midsagittal or coronal view of the tongue and allows visualizing the tongue shape, position, and movement.

How does ultrasound work?

An ultrasound probe, placed under the speaker’s chin, emits high-frequency sound waves that travel straight upwards from the probe through the chin and tongue tissue. At a boundary of two media with different density, the waves are reflected and travel back to the probe. Based on the time spent between the emission and the reception of the wave, and the information about the density of human tissue, the point of reflection is calculated and marked on an image as a bright point. In the case of tongue imaging, the reflection boundary is represented by the tongue surface and air above it or the tongue surface and bone, when the tongue touches the hard palate. Importantly, when the tongue tip is raised, an air pocket is created bellow it and the emitted ultrasound waves get reflected at the boundary of chin tissue and this air pocket, and cannot reach the tongue tip. For this reason, UTI cannot be used to reliably determine the exact tongue tip in the created images.

How does the tongue look like on the ultrasound images?

UTI allows observing tongue surface in two views: midsagittal and coronal. The view is changed simply by turning the probe by 90˚. In both views, the resulting images show a bright curve and the lower edge of this curve represents the tongue surface.

Midsagittal view (on the left in Figure 1) shows the tongue as viewed from the side of the head. The white curve in the image corresponds to the tongue surface from the front (on the right side of the image) to the back (on the left side of the image) of the oral cavity.

The coronal view (on the right in Figure 1) shows the tongue as viewed from the front of the head. The white curve corresponds to the tongue surface from one side of the tongue to the other.

Figure 1. Ultrasound images of the tongue. Left image: midsagittal view with the front of the tongue on the right side of the image. Right image: coronal view.

What else can be seen on the ultrasound images?

As seen in Figure 1, UTI images only tongue surface and no other structures in the oral cavity. The midsagittal image is typically limited by the shadows created by the jaw and the hyoid bones, while the coronal image is limited by the jaw bone only. The images also contain various gray-scaled areas representing less prominent points of ultrasound wave reflections due to tissue structure (such as muscles, fat cells).

However, UTI makes it possible to view the hard palate. To image the hard palate, the speaker is asked to take a sip of water and hold it in the mouth before swallowing it. Because the water has a similar density to the human tissue, the ultrasound waves emitted from the probe do not get reflected at the tongue surface-water boundary but only at the water-hard palate boundary. The obtained hard palate outline can then be superimposed on a real-time image to give partial information about the roof of the oral cavity.

Figure 2. Midsagittal ultrasound image of the tongue, palate, and water bolus. Jaw and hyoid shadows limit the view on the right and left side of the image, respectively. The front part of the tongue is on the right side of the image.

Tongue shape vs. position vs. movement

Tongue shape

UTI allows observing even small differences in the shape of the tongue. Furthermore, because the tongue does not behave as a single articulator, the method allows observing this differences in the front, middle and back part of the tongue (Figure 3), as well as in either of the tongue sides and the midline (Figure 4).

Figure 3. Midsagittal images of the tongue with the front part of the tongue on the right side of the image. Left image: the back of the tongue lowered, the middle part of the tongue lowered, front of the tongue raised. Right image: the back of the tongue raised, middle and front part lowered.

Figure 4. Coronal images of the tongue. Left image: both sides of the tongue lowered. Right image: both sides of the tongue raised with visible midline grove.

Tongue position

The position of the tongue in the oral cavity is less straightforward to define. Because no other structures of the oral cavity are directly visible on an image, the tongue position cannot be described in relation to them. However, the position of the tongue can be described either by its relation to the superimposed hard palate outline (Figure 5) or by the relationship between two tongue positions (Figure 6). Figure 5 allows describing the position of the tongue relative to the hard palate. In the left image, the front of the tongue is close to the palate/alveolar ridge, while on the right image, the font of the tongue is lower but the middle and back parts of the tongue are closer to the palate. Figure 6 demonstrates the relationship between two tongue images by plotting the extracted tongue contours together. Such a presentation makes it easier to compare the shape and position of two tongue contours relative to each other.

Importantly, to apply any of these techniques, the probe has to be kept in the same position when obtaining the hard palate image and the tongue images used for comparison. Because the probe emits the ultrasound waves straight upwards, it always scans only a section of the tongue that is straight above the probe. This means that if the probe location or its angle at which it’s placed against the chin changes, the section of the tongue that is straight above it changes as well. UTI does not scan the same part of the tongue and the resulting images cannot be directly compared to each other. The best method for obtaining the scan of the same part of the tongue is to use a special headset that fixes the probe under the chin and prevents its movement. Alternatively, and only suitable for short periods of time, the speaker can be asked to hold the probe still while speaking. This method was used in our experiments during the UTI practice and we have not noted any downsides to it.

Figure 5. Midsagittal tongue images with superimposed hard palate trace (in orange). The front of the tongue is on the right side of the image.

Figure 6. Left: midsagittal tongue image of /u/. Middle: midsagittal tongue image of /a/. Right: extracted traced tongue contours for /u/ (in green) and /a/ (blue) allow direct comparisons of the tongue contours shape and position relative to each other.

Tongue movement

Real-time UTI is an ideal method for observing tongue movements. Video 1 shows a learner saying “can”, on left, and “tan”, on the right.

Video 1. Midsagittal ultrasound video showing tongue movement during the production of “can” (on the left) and “tan” (on the right).

Benefits of UTI for L2 speech sounds

Because learners can observe their tongue, and the teacher’s tongue, in real-time when producing L2 speech sounds, the method allows them to, first, notice the difference between their own articulation and a standard one (as modeled by the teacher). Second, it makes it easier to understand what kind of movements have to be acquired to achieve the correct tongue shape and position. Third, the learners can improve control of tongue movements by utilizing visual feedback.

UTI furthermore more globally increases the awareness of tongue movements, which can potentially have a positive effect on future non-UTI pronunciation training. Importantly, UTI allows the learners to produce an L2 speech sound correctly early in the training, even within the first five minutes of practice. Successful production makes the learner realize that he or she can produce even difficult L2 sounds and have a great motivational effect on training.

The equipment


In the UTILE project, we used a Micro system with Articulate Assistant Advanced software (Articulate Instruments Ltd, 2012) by Articulate Instruments [ http://www.articulateinstruments.com/ultrasound-imaging/?target=Echo%20B ]. Figure 7 shows a screen view during pronunciation training.

Figure 7: Screen view during the pronunciation training using Articulate Assistant Advanced.

We are aware of two commercial ultrasound systems aimed at clinical practice: Sonospeech by Articulate Instruments and Speech Language Pathology set by SeeMore. However, almost any kind of medical ultrasound system with an appropriate probe can be used for tongue imaging (for more information of suitable probes see Lee at al., 2015).

Additional equipment

During the training with UTI, it is useful to be able to mark the target tongue shape or position, target location in the oral cavity, and/or palate. If the ultrasound software does not allow such annotations of the real-time images it is useful to overlay the screen with transparency and draw the annotation directly on it.

To carry out UTI, it is also necessary to use ultrasound gel, disinfecting wipes for the probe, and wipes for removing the gel from the speaker’s chin.

Delivery of the UTI pronunciation training

In the UTILE project we tested two methods of delivering pronunciation training: individual and classroom.

Individual pronunciation training

Individual pronunciation training has been a preferred method in the UTI L2 application (Gick et al., 2008; Sisinni et al., 2016; Kocjančič Antolík, Pillot-Loiseau & Kamiyama, 2019). In our experiment, the learners received three 45-minute long sessions delivered approximately one week apart (Kocjančič Antolík & Volín, 2019).

Pros: Individual practice allows a sole focus on the learner, offers a chance to produce many repetitions of the target movement(s), makes it possible to adapt the training based on the ongoing success, allows the opportunity to explain, discuss and analyze the performed and target movements.

Cons: Time-consuming both for the learner and trainer, high level of attention needed to control the tongue can result in fatigue of the learner.

Classroom pronunciation training

Classroom UTI pronunciation training has been less researched (Kühnert & Kocjančič Antolík, 2017). In our experiment, the learners participated in one or two sessions delivered during their regular 90-minute class on L2 pronunciation (Kocjančič Antolík, Bořil & Hofmann). During each session, the learners received about 7 minutes of individual training and each learner selected one to four speech sounds they wanted to practice. Additionally, they were asked to actively participate in the individual sessions of their classmates. They were all observing tongue movements of each classmate, comparing the movements to the output, and trying to (silently) make the target movements themselves.

Pros: The most important advantage of this method is that several learners can use UTI during one class and benefit not only from their own training but from the training of classmates as well. The training is still delivered individually to each learner. Training target(s) can be selected for each learner separately or all learners practice the same.

Cons: Less time to focus on the individual learner, as well as to explain and discuss the productions and targets, and a smaller amount of repetitions during the practice.

Teaching and practicing speech sounds with UTI

The main difference between traditional methods of teaching L2 speech sounds and UTI is that the initial focus of the training is not solely on the auditory perception of the produced sound but on tongue movements needed to produce the sound. The goal of the pronunciation training is that the learner acquires the necessary tongue movements for the correct production of the L2 sounds, with the ultimate goal of automatizing the production and using the newly acquired sounds in spontaneous speech. Because acquiring tongue movements is the central part of the pronunciation training, the training needs to be based on motor learning. Cleland et al. (2018) have summarized the principles of motor learning when using UTI as visual feedback in speech therapy. The same principles should be followed when using UTI as visual feedback in L2 speech sound learning or remediation.

Principles of motor learning

The motor learning process is divided into two parts: pre-practice, focused on the acquisition of the new movement, and practice, focused on the automatization of the movement.

In the pre-practice part, the learner needs a high dose of training at least once per week. The training has to be organized in blocks consisting of the same target in its simplest form (e.g. speech sound in isolation). The attention focus of the learner has to be external (how the produced item sounds) and internal (what kind of tongue movements were made). The feedback has to be given frequently, immediately after the production attempt, and with a short delay and should give information about the realization of the tongue movement and the correctness of the resulting sound.

A high dose of training at least once per week is required also in the practice part of the motor learning process. However, the training can be organized in random blocks of different and more complex items with external attentional focus. The feedback about the correctness of the produced sound is given less frequently and only after a short delay. The judgment on the correctness should be passed to the learner.

In the UTILE experiments, we focused on the practical application of using UTI in L2 speech sound learning and remediation. Because of the practical component, linked mainly to time constraints, the pronunciation training was focused mainly on the acquisition of new tongue movements.

UTILE protocol for speech sound training with UTI

  1. Initial familiarization with the UTI
    After a brief explanation of how ultrasound works and how the tongue images look like, the learner observes his or her own tongue during speech. The learner is instructed to pay attention to how movements of different parts of the tongue look on the real-time image. For example, the learner can be asked to produce all the native vowels, front (e.g. /t, n, s/) and back (e.g. /k/) consonants, and to comment on the difference in tongue shape and position.
  2. Learner’s production of the target speech sound(s)
    The learner produces the target L2 sound in isolation and describes the tongue shape and movement.
  3. Teacher’s production
    The teacher produces the target L2 sound in isolation and the learner describes the tongue shape and movement. The learner has to comment on the differences between own and teacher’s production.
  4. Learner’s practice
    The practice starts with practicing the target L2 sound in isolation and progresses to a simple one-syllables sequence of a consonant and a vowel (CV or VC), followed by CVC or VCV sequences, monosyllables with consonantal clusters (CCV, VCC, VCCV), and multisyllabic sequences. Whenever possible, real words are used for practice. At the last practice stage, real words containing the target L2 sounds are produced in a sentence. The complexity of the practicing material increases once the learner correctly produces 10 repetitions of the practiced complexity level.
    During the session, the learner observes his or her tongue in real-time on the computer screen. The teacher guides the learner by noting which part of the tongue needs to move in a certain direction. Once the learner successfully produces the target speech sound in isolation while observing the tongue on the screen, he or she is asked to say the same sound with eyes closed and focusing on how does it feel to make a new movement. After a few correctly produced repetitions with closed eyes and focus on the tongue movements, the learner is asked to focus also on how the produced speech sound sounds.
    In the first training session, the feedback about the tongue movements and the resulting sound is initially given after every attempt. From the second session on, the feedback is delayed, giving the learner a chance to evaluate their own productions first.
    At the end of each session, the learner describes the necessary tongue movements with their own words to help to memorize the motoric components. The learner is asked to practice the production of the target L2 sound at home and to specifically pay attention to the tongue movements.

Example of pronunciation training with UTI

Video 2 shows an example of pronunciation training with UTI. The goal for the learner was to improve the production of vowel /u:/. The video first shows the learner’s production of the vowel /u:/ before and after receiving instructions on tongue position. Successful trials are followed by practicing the vowel preceded or followed by consonants /k/ and /t/. Finally, the learner produces the vowel /u:/ in isolation with minimal effort. Importantly, the whole training session lasted about seven minutes, during which the learner practiced three different vowels.

Video 2. Example of UTI training. The front of the tongue is on the left side of the image.

Learners’ experience

Overall, the learners reported the following:

  • Ultrasound images of the tongue are easy to understand.
  • Observing their tongue in real-time helped them to understand the target movements and to better control their own tongue when practicing.
  • All of the learners were able to produce the target sound correctly, most of them already in the first training session. This made them realize that they can produce difficult L2 sounds correctly.
  • The learners practicing in the classroom setting agreed that observing the practice of classmates helped them to better understand the target movements.
  • All of the learners agreed that using UTI is beneficial and that they would like to continue using it.
  • L2 teacher who collaborated in the classroom training commented that the method significantly simplifies describing the necessary tongue movements to the learners, and helps to better understand individual articulatory difficulties of the learners.


Articulate Instruments Ltd. (2012). Articulate Assistant Advanced User Guide: Version 2.14. Edinburgh, UK: Articulate Instruments Ltd.
Bliss, H., Abel, J. & Gick, B. (2018). Computer-assisted visual articulation feedback in L2 pronunciation instruction: a review. Journal of Second Language Pronunciation 4, 129-153.
Cleland, J., Wrench, A., Lloyd, S. & Sugden, E. (2018). ULTRAX2020: Ultrasound technology for optimising the treatment of speech disorders: Clinicians‘ resource manual.
Gick, B., Berhardt, B., Bacsfalvi, P., Wilson, I. 2008. Ultrasound imaging applications in second language acquisition. In: Hansen Edwards, J.G., Zampini, M.L. (eds.), Phonology and second language acquisition. Amsterdam: John Benjamins, (pp. 309–322).
Kocjančič Antolík, T. (2020). Ultrasound Tongue Imaging in Second Language Learning. Studie z aplikované lingvistiky, 2020, 1, 109-116.
Kocjančič Antolík, T. & Volín, J. (2019). Ultrasound tongue imaging for vowel remediation in Czech English. In Sasha Calhoun, Paola Escudero, Marija Tabain & Paul Warren (eds.) Proceedings of the 19th International Congress of Phonetic Sciences, Melbourne, Australia 2019 (pp. 3651-3655). Canberra, Australia: ASSTA Inc.
Kocjančič Antolík, T., Bořil, T. & Hofmann, S. (in review). Acoustic and articulatory visual feedback in classroom L2 vowel remediation. Language Learning & Technology.
Kocjančič Antolík, T., Pillot-Loiseau, C. & Kamiyama, T. (2019). The effectiveness of real-time ultrasound visual feedback on tongue movements in L2 pronunciation training: Japanese learners improving the French vowel contrast /y/-/u/. Journal of Second Language Pronunciation, 5, 72-97.
Kühnert, B. & Kocjančič Antolík, T. (2017). Exploring the use of ultrasound visual feedback in the classroom: a pilot study on the acquisition of selected English vowel contrasts by French learners. 5th International Conference on English Pronunciation: Issues & Practices. Caen, France.
Sisinni, B., d’Apolito, S., Fivela, B. G. & Grimaldi, M. (2016). Ultrasound articulatory training for teaching pronunciation of L2 vowels. ICT for language learning, 265–270.
Stone, M. (2005). A guide to analysing tongue motion from ultrasound images. Clinical linguistics & phonetics, 19(6-7), 455-501.

Úvod > Věda a výzkum > Ukončené projekty > UTILE > UTILE Pronunciation training – Practical guidelines