A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context

Song-Yun Kim; Jin-Hyok Song; Dae-Hun Pak; Dong-Song Pak; Myong-Hyok Won; Hakho Hong

doi:doi:10.11648/j.ajetm.20251006.11

Research Article |

| Peer-Reviewed

A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context

Song-Yun Kim, Jin-Hyok Song, Dae-Hun Pak, Dong-Song Pak, Myong-Hyok Won, Hakho Hong^*

Published in American Journal of Engineering and Technology Management (Volume 10, Issue 6)

Received: 19 November 2025 Accepted: 8 December 2025 Published: 29 December 2025

Views: Downloads:

Download PDF

Share This Article

Twitter
Linked In
Facebook

Abstract

While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.

Published in	American Journal of Engineering and Technology Management (Volume 10, Issue 6)
DOI	10.11648/j.ajetm.20251006.11
Page(s)	94-100
Creative Commons	This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.
Copyright	Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

TTS, Incremental TTS, Encoder, Decoder, Transformer

References

[1]	Pouget, M., Hueber, T., Bailly, G., Baumann, T., “Hmm training strategy for incremental speech synthesis,” in 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), pp. 1201–1205, 2015. https://doi.org/10.21437/Interspeech.2015-304
[2]	Pouget, M., Nahorna, O., Hueber, T., Bailly, G., “Adaptive latency for part-of-speech tagging in incremental text-to-speech synthesis,” in 17th Annual Conference of the International Speech Communication Association, pp. 2846–2850, 2016. https://doi.org/10.21437/Interspeech.2016-165
[3]	Baumann, T., Schlangen, D., “Evaluating prosodic processing for incremental speech synthesis,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012. https://doi.org/10.21437/Interspeech.2012-152
[4]	Baumann, T., “Decision tree usage for incremental parametric speech synthesis,” in Proc. of ICASSP, pp. 3819–3823, 2014. https://doi.org/10.1109/ICASSP.2014.6854316
[5]	Yanagita, T., Sakti, S., Nakamura, S., “Incremental TTS for Japanese language,” in Proc. Interspeech, pp. 902–906, 2018. https://doi.org/10.21437/Interspeech.2018-1561
[6]	Yanagita, T., Sakti, S., Nakamura, S., “Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework”, In Proc. 10th ISCA Speech Synthesis Workshop, 2019. https://doi.org/10.21437/SSW.2019-33
[7]	Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In Proceedings of Spoken Language Technology Workshop (SLT). https://doi.org/10.1109/SLT48900.2021.9383551
[8]	Ren Y., Hu C., Tan X., Qin T., Zhao S., Zhao Z., & Liu T.-Y. (2021). Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of international conference on learning representations (ICLR 2021). https://doi.org/10.48550/arXiv.2006.04558
[9]	Saeki, T., Takamichi, S., Saruwatari, H. “Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model.” arXiv preprint arXiv: 2012.12612v2 [cs. SD], 2021. https://doi.org/10.1109/LSP.2021.3073869
[10]	Muyang D., Chuan L., Junjie L. “Incremental FastPitch: Chunk-Based High Quality Text To Speech.” arXiv preprint arXiv: 2401.01755v1 [cs. SD], 2024. https://doi.org/10.48550/arXiv.2401.01755
[11]	Yanagita, T., Sakti, S., Nakamura, S., "Japanese Neural Incremental Text-to-speech Synthesis Framework with an Accent Phrase Input", Volume 11, 22355-22363, IEEE Access, Mar. 2, 2023.
[12]	Kayyar, K., Dittmar, C., Pia, N., Habets, E., “Low-resource text-to-speech using specific data and noise augmentation,” in Proc. IEEE-SPS European Signal Processing Conf., pp. 61-65., 2023.
[13]	Bataev, V., Ghosh, S., Lavrukhin, V., Li, J., "TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer," arXiv preprint arXiv: 2501.06320v1, 2025.
[14]	Chen, Y., Niu, Z., Ma, Z., Deng, K., Wang, C., Zhao, J., Yu, K., Chen, X. "F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching." arXiv preprint arXiv: 2410.06885, 2024.
[15]	Shen, K., Ju, Z., Tan, X., Liu, E., Leng, Y., He, L., Qin, T., Zhao, S., Bian, J., "NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers." In Proc. Intl. Conf. Learning Representations (ICLR), 2024.
[16]	Peng, P., Huang, P., Li, D, Mohamed, A., Harwath, D., " Voice-Craft: Zero-Shot Speech Editing and Text-to-Speech in the Wild." arXiv preprint arXiv: 2403.16973, 2024.

Cite This Article

Plain Text BibTeX RIS

APA Style

Kim, S., Song, J., Pak, D., Pak, D., Won, M., et al. (2025). A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. American Journal of Engineering and Technology Management, 10(6), 94-100. https://doi.org/10.11648/j.ajetm.20251006.11

Copy | Download

ACS Style

Kim, S.; Song, J.; Pak, D.; Pak, D.; Won, M., et al. A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. Am. J. Eng. Technol. Manag. 2025, 10(6), 94-100. doi: 10.11648/j.ajetm.20251006.11

Copy | Download

AMA Style

Kim S, Song J, Pak D, Pak D, Won M, et al. A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. Am J Eng Technol Manag. 2025;10(6):94-100. doi: 10.11648/j.ajetm.20251006.11

Copy | Download

@article{10.11648/j.ajetm.20251006.11,
  author = {Song-Yun Kim and Jin-Hyok Song and Dae-Hun Pak and Dong-Song Pak and Myong-Hyok Won and Hakho Hong},
  title = {A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context},
  journal = {American Journal of Engineering and Technology Management},
  volume = {10},
  number = {6},
  pages = {94-100},
  doi = {10.11648/j.ajetm.20251006.11},
  url = {https://doi.org/10.11648/j.ajetm.20251006.11},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajetm.20251006.11},
  abstract = {While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.},
 year = {2025}
}

Copy | Download

TY  - JOUR
T1  - A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context
AU  - Song-Yun Kim
AU  - Jin-Hyok Song
AU  - Dae-Hun Pak
AU  - Dong-Song Pak
AU  - Myong-Hyok Won
AU  - Hakho Hong
Y1  - 2025/12/29
PY  - 2025
N1  - https://doi.org/10.11648/j.ajetm.20251006.11
DO  - 10.11648/j.ajetm.20251006.11
T2  - American Journal of Engineering and Technology Management
JF  - American Journal of Engineering and Technology Management
JO  - American Journal of Engineering and Technology Management
SP  - 94
EP  - 100
PB  - Science Publishing Group
SN  - 2575-1441
UR  - https://doi.org/10.11648/j.ajetm.20251006.11
AB  - While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.
VL  - 10
IS  - 6
ER  -

Copy | Download

Author Information

Song-Yun Kim

Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea
Jin-Hyok Song

Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea
Dae-Hun Pak

Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea
Dong-Song Pak

Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea
Myong-Hyok Won

Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea
Hakho Hong

Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea

Contact Email

http://orcid.org/0000-0003-1018-7262

Download PDF

Submit an Article

Sections

Plain Text BibTeX RIS

APA Style

Kim, S., Song, J., Pak, D., Pak, D., Won, M., et al. (2025). A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. American Journal of Engineering and Technology Management, 10(6), 94-100. https://doi.org/10.11648/j.ajetm.20251006.11

Copy | Download

ACS Style

Kim, S.; Song, J.; Pak, D.; Pak, D.; Won, M., et al. A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. Am. J. Eng. Technol. Manag. 2025, 10(6), 94-100. doi: 10.11648/j.ajetm.20251006.11

Copy | Download

AMA Style

Kim S, Song J, Pak D, Pak D, Won M, et al. A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. Am J Eng Technol Manag. 2025;10(6):94-100. doi: 10.11648/j.ajetm.20251006.11

Copy | Download

@article{10.11648/j.ajetm.20251006.11,
  author = {Song-Yun Kim and Jin-Hyok Song and Dae-Hun Pak and Dong-Song Pak and Myong-Hyok Won and Hakho Hong},
  title = {A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context},
  journal = {American Journal of Engineering and Technology Management},
  volume = {10},
  number = {6},
  pages = {94-100},
  doi = {10.11648/j.ajetm.20251006.11},
  url = {https://doi.org/10.11648/j.ajetm.20251006.11},
  eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajetm.20251006.11},
  abstract = {While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.},
 year = {2025}
}

Copy | Download

TY  - JOUR
T1  - A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context
AU  - Song-Yun Kim
AU  - Jin-Hyok Song
AU  - Dae-Hun Pak
AU  - Dong-Song Pak
AU  - Myong-Hyok Won
AU  - Hakho Hong
Y1  - 2025/12/29
PY  - 2025
N1  - https://doi.org/10.11648/j.ajetm.20251006.11
DO  - 10.11648/j.ajetm.20251006.11
T2  - American Journal of Engineering and Technology Management
JF  - American Journal of Engineering and Technology Management
JO  - American Journal of Engineering and Technology Management
SP  - 94
EP  - 100
PB  - Science Publishing Group
SN  - 2575-1441
UR  - https://doi.org/10.11648/j.ajetm.20251006.11
AB  - While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.
VL  - 10
IS  - 6
ER  -

Copy | Download