When humans converse with each other, they naturally combine information from different modalities such as speech, gestures, facial/head pose and expressions, etc. With the proliferation of computerized devices, humans have more opportunities to interact with displays associated with the computerized devices. Spoken dialog systems, or conversational systems, enable human users to communicate with computing systems by various modes of communication, such as speech and/or gesture. Current conversational systems identify intent of a user interacting with a conversational system based on the various modes of communication. In some examples, conversational systems resolve referring expressions in user utterances by computing a similarity between a user's utterance and lexical descriptions of items and associated text on a screen. In other examples, on-screen object identification is necessary to understand a user's intent because the user's utterance is unclear with respect to which on-screen object the user can be referring. Accordingly, current techniques leverage multi-modal inputs, such as speech and gesture, to determine which objects a user refers to on a screen.