Computer image synthesis or generation refers to a process of generating realistic images of objects, landscapes, people, faces, and the like. For example, neural networks, such as deep generative neural networks, may be employed in image generation in various applications such as generating super-resolution images from low resolution images, image in-painting, text-to-image synthesis, attribute-to-face synthesis and the like. Such neural network systems may generate images by random sampling, or may generate conditional images that match specific conditions. Conditional image generation may find use in many diverse applications, such as in forensics applications where an image of a face of a suspect may be generated to match a description of the suspect; in education or research applications, where fine-grained images of, for example, birds may be generated to match specific descriptions of the birds; and so on. Current image generations systems, however, lack ability to generate sufficiently diverse image and/or are unable to produce images with sufficiently high fidelity. Moreover, current image generation systems are not controllable in that they lack ability to sample images by controlled change of factors such as posture, style, background, fine-grained details, and the like.