A content sharing platform may receive millions of messages from users desiring to share media content such as audio, images, and video between user devices (e.g., mobile devices, personal computers, etc.). In some of these feature-rich multimodal social media platforms, images and videos may be first-class citizens, whereas text plays a supporting role. While this allows users to express themselves in new and exciting ways, this textual sparsity becomes problematic when developing a text-centric search functionality.