In some speech processing systems, speech enhancement (SE) and automated speech recognition (ASR) are realized by separate engines. A SE module sends an enhanced single channel audio signal as well as some metadata to an ASR module. The original multi-channel recordings (e.g. originating from a microphone array) contain information that may be useful for speech detection, such as spatial information that enables distinguishing a target speaker from interfering speakers and/or knowledge about a reference signal, which can be useful in echo cancellation. In known systems this data is only available to the speech enhancement module where it is condensed into a stream of metadata that is sent in parallel to the enhanced single-channel signal.