OpenAI recently released an alternative method called Point-E for generating 3D objects from Text Prompts that takes less than two minutes on a single GPU, compared to other methods that can take several hours. This new model is based on diffusion models that are generative such as GLIDE and StableDiffusion, Infoq wrote on the topic.
The model pipeline starts by generating a synthetic view conditioned on text prompts. A 3D point cloud (1024 points) is then created over the synthetic view. Finally, a fine 3D point cloud (4096 points) is created conditioned on the low resolution point cloud and the synthetic view.
After obtaining a 3D point cloud with better resolution, the authors convert it into texture meshes and render them in Blender. The process uses a regression model to predict the signed distance field (SDF) of an object given its point cloud, and then marching cubes are applied to the resulting SDF to extract a mesh. Color assignment uses the “nearest neighbor” method to match each vertex to the nearest point in the original point cloud.
Earlier this year, Google released DreamFusion, an expanded version of Dream Fields, a generative 3D system the company introduced back in 2021. Comparing DreamFusion with Point-E based on a semantic metric called R-Precision, we can tell from the table above that the former performs better in this regard, i.e. it understands text cues better and the generated point cloud has better resolution. However, we can say that Point-E is much faster at outputting a 3D object with a point cloud.
The limitations of Point-E are the low texture and resolution of its 3D point cloud objects. It requires synthetic renderers, which could be replaced by real-world image conditioning. Semantic understanding from textual cues is not as good as with other state-of-the-art 3D object generation models.