Autonomous vehicles are generally capable of sensing the environment and navigating without human input. As a requirement, autonomous vehicles need to be able to attend to and classify potentially moving objects in dynamic surroundings. However, the capability of tracking multiple objects within video sequences and predicting where the multiple objects are going to be located in the future remains a challenge. While existing efforts attained results in predicting trajectories of an object based on previous locations of the object, the models used tend to lack the capability to extract spatiotemporal feature dynamics from videos to enhance detections and improve trajectory predictions for object tracking.