There exist computer software products, which make it possible to decompose a text, build up semantic nets on its basis and this allows the objects, subjects, practices, conceptual hierarchy, execution sequence, qualitative, quantitative and timing performances to be determined out of the text. There exist products, which make it possible to build up manually a three-dimensional scene adding the models of characters, objects, models of buildings, to arrange cameras and add light sources, to animate a scene and finally obtain a three-dimensional animation video clip or static pictures. However, there are no scalable distributed systems, which would make it possible to automatically obtain a three-dimensional video clip or picture on the basis of a natural language text.