The ability to measure or quantify similarity between the spoken content of two segments of audio can provide meaningful insight into the relationship between the two segments. However, apart from creating a time-aligned text transcript of the audio, this information is largely inaccessible. Speech-to-text algorithms require dictionaries, are largely inaccurate, and are fairly slow. Human transcription, while accurate, is time-consuming and expensive. In general, low-level, feature-extraction based approaches for identifying similarities between audio files search for audio duplications.