How should we optimize the text-CAD generation technology that Google and Nvidia are both involved in?

Article reprinted from: Yangz
By Reggie Raye
Source: The Gradient
Image source: Generated by Unbounded AI tool
The dust has yet to settle on AI-driven text-to-image generation. However, the results are already clear: a flood of bad images. Sure, there are some high-quality images, but not enough to outweigh the loss in signal-to-noise ratio -- for every artist who benefits from a Midjourney-generated album cover, there are fifty who will be fooled by a Midjourney-generated deepfake. In a world where a loss in signal-to-noise ratio is the root of many ills (think scientific research, journalism, government accountability), this is not a good thing.
It’s now necessary to treat all images with a certain amount of skepticism. (Admittedly, this has been the case for a long time, but the increasing number of deepfakes should also increase people’s vigilance, which is cognitively taxing in addition to being unpleasant.) Constant suspicion - or frequent misdirection - seems to be a high price to pay for a digital gadget that no one cares about and has so far brought few benefits. Hopefully - or more appropriately, pray - the cost-benefit ratio will soon reach a sane state.
But at the same time, we should note a new phenomenon in the field of generative AI: AI-driven text-to-CAD generation. The premise is similar to text-to-image programs, except that instead of returning an image, the program returns a 3D CAD model.
Ask the AI ​​to give you an image of "Mona Lisa, but wearing Balenciaga" and it will convert it into 3D
Here are some definitions. First, computer-aided design (CAD) refers to software tools that allow users to create digital models of physical objects, such as cups, cars, bridges, etc. (Models in the context of CAD have nothing to do with deep learning models; Toyota Camry ≠ recurrent neural network.) But CAD is important too; try to think of the last time you didn’t see an object designed in CAD.
Now that we have the definitions, let’s look at the big players that want to enter the world of text-to-CAD: Autodesk (CLIP-Forge), Google (DreamFusion), OpenAI (Point-E), and Nvidia (Magic3D). Here are some examples from each company:
The major players haven’t stopped startups from popping up at a rate of nearly one per month as of early 2023, with CSM and Sloyd perhaps the most promising.
Then there are some fantastic tools that can be called 2.5 D because their output is somewhere between 2-D and 3-D. The idea behind these tools is that users upload an image and the AI ​​guesses how that image will look in three-dimensional space.
This Greedy Cup uses AI to turn SBF (Sam Bankman-Fried, portrayed as a wolf in sheep's clothing and a piper) into a relief (Credit: Reggie Raye/TOMO)
The open source animation and modeling platform Blender is undoubtedly the leader in this field, and the CAD modeling software Rhino now also has plugins such as SurfaceRelief and Ambrosinus Toolkit, which can generate 3D depth maps from ordinary images very well.
First of all, it should be said that all of this is exciting. As a CAD designer, I eagerly anticipate the potential benefits. Engineers, 3D printing enthusiasts, video game designers, and many others will also benefit.
However, Text-to-CAD also has a number of disadvantages, many of which are serious. Here is a brief list:
Opening the door to mass production of weapons, racist or other undesirable material
Triggering a wave of junk models, which in turn pollutes the model library
Violation of the rights of creators of copyrighted content
Regardless, text-to-CAD is coming, whether we like it or not. But thankfully, there are things technologists can do to improve the program’s output and reduce its negative impact. We’ve identified three key areas where such programs can take things up a notch: data set organization, usability schema languages, and filtering.
To the best of our knowledge, these areas have been largely unexplored in the context of text-to-CAD. The idea of ​​a usability pattern language will receive special attention as it has the potential to significantly improve output. Notably, this potential is not limited to CAD; it could improve results in most areas of generative AI, such as text and images.
Dataset Management
Passive Collection
While not all text-to-CAD approaches rely on a training set of 3D models (Google’s DreamFusion is an exception), curating a dataset of models is still the most common approach. Needless to say, the key here is to curate a good set of models to train.
The key to doing this is two-fold. First, avoid the obvious sources of models: Thingiverse, Cults3D, MyMiniFactory. There are some high-quality models out there, but the vast majority are garbage. (This Reddit thread "Why is Thingiverse so bad?" illustrates this point). Second, look for extremely high-quality model repositories. (Scan the World is probably the best.)
Second, model sources can be weighted according to quality. An MFA would likely jump at the chance to do such annotation work -- and, due to the unfairness of the labor market, they would be paid very little.
Proactive planning
Curation can and should play a more active role. Many museums, private collections and design companies are happy to have their industrial design collections 3D scanned. Moreover, in addition to generating a rich corpus, scanning can create a powerful record of our fragile culture.
The French were able to rebuild Notre Dame after the fire thanks to an American's 3D scanning technology. Photo credit: Andrew Tallon/Vassar College
Enriching data
In the process of creating a high-quality corpus, technologists must think carefully about what they want the data to do. At first glance, the primary use case might be “empower managers at hardware companies to move a few sliders, output the desired product blueprint, and then it can be manufactured.” However, if the history of mass customization failures is any indication, this approach is likely to fail.
We believe a more effective use case is to empower domain experts – such as industrial designers at a product design company – to prompt engineers until they get the right output, then fine-tune it and finalize it.
Such use cases require things that might not be obvious at first glance. For example, domain experts need to be able to upload images of reference products, as in Midjourney, and then label them according to their target attributes - style, material, dynamics, etc. In this case, it might be tempting to take a faceted approach, where experts can select style type, material type, etc. in a drop-down menu. But experience shows that creating attribute buckets by enriching the dataset is not advisable. The music streaming service Pandora took this manual approach, but was ultimately defeated by Spotify, which relied on neural networks.
reward
There is little work being done in the area of ​​rigorous dataset curation (with a few exceptions), and we have much to gain from it. This should be a top priority for companies and entrepreneurs seeking a competitive advantage in the text-to-CAD war. A large, rich dataset is hard to manufacture and hard to imitate, and it is the best kind of mote.
From a less corporate perspective, thoughtful dataset curation is the ideal way to drive the creation of beautiful products. To date, generative AI tools have reflected the priorities of their creators, but not necessarily taste. We should take a stand for the importance of beauty. We should care that what we bring into the world will fascinate users and stand the test of time. We should oppose the idea of ​​mediocre products riding the wave of mediocrity.
If some people believe that beauty is not an end in itself, then perhaps they will be convinced by two data points: sustainability and profit.
The most iconic products of the past 100 years -- the Eames chair, the Leica camera, the Vespa scooter -- are treasured by those who use them. Vibrant enthusiasts restore them, sell them, and continue to use them. Maybe their complex design requires 20% more emissions than the competition at the time. That's okay. Their lifespans are measured in quarter-century rather than years, which means they consume less and emit less.
A 1963 Vespa GS 160 would cost $13,000 in 2023
As for profits, it’s no secret that beautiful products come at a premium. . The specs of an iPhone can never match those of a Samsung. Yet Apple charges 25% more. The cute Fiat 500 subcompact doesn’t get the same gas mileage as an F-150. But that’s okay, Fiat bet right, yuppies are willing to pay $5,000 more for cuteness.
Availability Pattern Language
Overview
Pattern languages ​​were first developed by generalist Christopher Alexander in the 1970s. They are defined as a set of mutually reinforcing patterns, each describing a design problem and its solution. While Alexander's first pattern language was targeted at architectural design, it has been successfully applied to many fields (most notably programming) and is at least as useful in the field of generative design.
In Text-to-CAD, the pattern language consists of a series of patterns; for example, one pattern for kinematic parts, one pattern for hinges (a subset of kinematic parts, so one level below the abstraction), and one pattern for friction hinges (another level below the abstraction). The format of a friction hinge pattern is as follows:
Like natural languages, pattern languages ​​consist of a vocabulary (a set of design solutions), a syntax (the location of solutions in the language), and a grammar (the rules by which patterns can solve problems). Note that the pattern “friction hinge” above is a node in a hierarchical network, which can be visualized visually using a directed network graph.
These patterns embody the best practices in design fundamentals -- human factors, functionality, aesthetics, etc. As a result, the output of these patterns will be more usable, easier to understand (avoiding the black box problem), and easier to fine-tune.
The bottom line is, unless a text-to-CAD program takes into account the fundamentals of design, its output will be garbage. Doing nothing is better than having a text-to-CAD-generated laptop with a screen that won't stay upright.
Of all these essential elements, perhaps the most important and the most difficult to consider is designing for the human factor. The human factors that need to be considered to design a useful product are almost endless. Pinch points, finger pinching, poorly placed sharp edges, ergonomic proportions, and more must be identified and designed out by AI.
practice
Let's look at a real-world example. Let's say Jane is an industrial designer at ABC Design Studio, which has been commissioned to design a futuristic gaming laptop. With current technology, Jane could use a CAD program like Fusion 360, enter Fusion's generative design workspace, and spend a week (or month) working with her team to specify all the relevant constraints: loads, conditions, targets, material properties, and so on.
However, no matter how powerful Fusion’s generative design workspace is, it can’t get around one key fact: users must have a significant amount of domain expertise, CAD skills, and time.
A more pleasant user experience would be to simply prompt the text into the CAD program until its output meets the user's requirements. Such a pattern design-centric workflow might look like this:
Jane prompts her text-to-CAD program: "Show me some examples of future gaming laptops. Inspired by the shape of the TOMO laptop stand and the surface texture of the King Cobra".
Fully implementing text-to-CAD conversion will close the loop from image to manufacturable product.
The program outputs six concept drawings, each of which contains patterns such as "keyboard layout", "hinge structure" and "port layout of consumer electronics products".
Jane could respond by saying, "Give me some variations of image 2. Make the screen more recessed and the keyboard more textured."
Jane: "I like the third one, what are the parameters?"
The system lists 20 parameters -- length, width, display height, key density, etc. -- in the "Solution" field for the pattern it thinks is most relevant.
Jane notices that the hinge type is not specified, so she enters "Add hinge type parameters to the list and export CAD model."
She opens the model in Fusion 360 and is pleased to see that a proper friction hinge has been added. With the hinge parameterized, she increases the width parameter because she knows that Studio ABC's client wants the screen to withstand heavy use.
Jane continues to tweak it until she is completely satisfied with the form and function, so she can then hand it off to her colleague Joe, a mechanical engineer, who will review it and see which custom parts can be replaced with stock versions.
Finally, Studio ABC’s management is pleased that the laptop design process was reduced from an average of 6 months to 1 month. They are also pleased that, thanks to parametric technology, any modifications requested by the client can be accommodated quickly without the need for a redesign.
Thorough filtration
As AI ethicist Irene Solaiman pointed out in a recent interview, generative AI is in dire need of thorough safeguards. Even with a pattern language approach, generative AI itself is not immune to producing bad outputs. That’s where guardrails come in.
We need to be able to detect and reject requests for weapons, gore, child sexual abuse material (CSAM), and other objectionable content. Technologists who fear lawsuits might add copyright products to this list. However, if we know from experience, objectionable requests probably make up a large portion of queries.
Many of these requirements will be met once the text-to-CAD model is open source or leaked. (If the Defense Distributed saga has taught us anything, it’s that the genie is never back in the bottle; thanks to a recent Texas ruling, an American can now legally download an AR-15, 3D print it, and then—if he feels threatened—use it to shoot someone.)
Additionally, we need widely shared performance benchmarks, similar to those that have emerged around LLMs. After all, if you can’t measure it, you can’t improve it.
____
In summary, the advent of AI-driven text-to-CAD generation technology presents both risks and opportunities, with the ratio of the two still far from certain. The proliferation of low-quality CAD models and toxic content are just a few issues that require immediate attention.
There are also some neglected areas where technologists can usefully focus. The curation of datasets is critical: we need to track down high-quality models from high-quality sources and explore other approaches, such as scanning industrial design collections. Usability pattern languages ​​could provide a powerful framework for incorporating best design practices. In addition, pattern languages ​​would also provide a powerful framework for the generation of CAD model parameters that can be fine-tuned until the model meets the requirements of its use. Finally, comprehensive filtering techniques must be developed to prevent the generation of dangerous content.
We hope that the insights presented in this article will help technologists avoid the pitfalls that have plagued generative AI to date and improve the capabilities of text-to-CAD to provide good models that will benefit the many people who will soon use them.