Digital signal processing techniques are available that pre-process a noisy speech signal to improve the quality of the speech therein. The pre-processed speech signal would sound less noisy to a person listening (by virtue of having better sound quality or being more intelligible.) Alternatively, an automatic speech recognition (ASR) process operating upon the pre-processed signal would have a lower word error rate. Speech signal enhancement techniques include those that are based on using a Deep Neural Network (DNN) to produce the enhanced speech signal. Configuring a DNN processor (enabling it to learn) is a big task, both computationally heavy and time intensive. For example, the input “features” of a conventional DNN processor can be a dozen audio frames (e.g., up to 50 milliseconds each) that are behind, and a dozen audio frames that are ahead, of a middle audio frame (part of the input or noisy speech signal.) The input feature for each frame can be of very high length (large dimension). Often this input feature has little or no structure, with the prevailing thought being that a sufficiently large or complex DNN will extract from this feature the patterns and information it needs. The input features, over multiple audio frames, are processed by a previously trained, DNN processor, to produce a single output or target frame. The target frame is a prediction or estimate of the large-dimension feature of the clean speech that is in the input noisy speech signal, based on the large-dimension input features contained in a number of earlier in time input frames and a number of later in time input frame. It is however difficult to visualize what the DNN processor is doing with such a huge amount of information, and so a more intuitive solution is desirable that is also more computationally efficient.