Hierarchical text-conditional image generation with clip latents