Today I read a paper titled “Speech Synthesis with Neural Networks”
The abstract is:
Text-to-speech conversion has traditionally been performed either by concatenating short samples of speech or by using rule-based systems to convert a phonetic representation of speech into an acoustic representation, which is then converted into speech.
This paper describes a system that uses a time-delay neural network (TDNN) to perform this phonetic-to-acoustic mapping, with another neural network to control the timing of the generated speech.
The neural network system requires less memory than a concatenation system, and performed well in tests comparing it to commercial systems using other technologies.