DALL·E roadmap new features
DALL·E roadmap new features

18 features we’d love to see in DALL·E 2 next

What features should OpenAI add to DALL·E 2 next? Here are 16 ideas from the user community.

Preamble

OpenAI has a fine track record for advanced AI research and discoveries.

But commercialising DALL·E requires a refocus from the rarefied culture of bleeding-edge intellectual inquiry, to the pedestrian concerns of the modern SaaS business. Customer service! UX! Responsive browser design! Ugh – it’s not exactly transformative AGI, is it? But with a million users hungry to try DALL·E out – yet all liable to churn out if they don’t find a reason to keep using it – OpenAI might consider making it a little more user-friendly.

When OpenAI launched their text AI, GPT-3, they essentially outsourced the ‘user experience’ bit. They made an API available to developers, who then used the model to power more consumer-friendly tools like AI copywriters and chatbots. (This also meant OpenAI only had to deal directly with a few commercial users!)

This time around, hoping to keep their image generator (and its myriad potential misuses) on a tighter leash, OpenAI seem determined to run the whole DALL·E show in-house.

Despite the enormous ‘wow!’ factor, success is not yet guaranteed…

💰 With each prompt costing $0.13, users will now be much less tolerant of misfires and errors (even self-inflicted ones)

⏱ With only 50 free ‘trial’ credits, new users don’t have infinite time to learn what DALL·E can do – they need to be confident (and ready to open their wallets!) when time runs out

⚠️ Google (Imagen, Parti) and Meta (Make-A-Scene) already have rival models waiting in the wings – they also have enormous audiences and massive institutional focus on commercialising consumer tech

🏃🏻‍♂️ On the independent side, tools like MidJourney and Craiyon are making impressive leaps forward every day, and new entrants like Stability.ai‘s #StableDiffusion are joining the fray

🤷🏻‍♂️ Although DALL·E is a huge technical accomplishment and enormous fun to play with, the commercial and business use-cases that drive consistent use are not necessarily smack-you-in-the-face obvious – it’s an entirely new kind of tool with some considerable limitations

✍️ The GPT-3 example is instructive: it’s very clever and engaging to play with, but few people would currently trust AI to write anything mission-critical, or pay lots of money to natter on about nothing in particular. (It is, however, being heavily used to generate spam content!)

With all that in mind, we’re likely to see some changes to the DALL·E experience coming soon – if OpenAI choose to make it a priority, that is. Here are 18 things users like you and I might hope to see in the near-and-far future.

Onboarding & user empowerment features

DALL·E is really just an empty text box. And although It’s very powerful, it’s totally inscrutable.

Scroll down, and you’ll find a mere 42 handpicked examples of previous outputs. (And once you’ve used up your free credits, it literally costs money if you click on one!)

There are very few ‘affordances’ – the kind of discoverable buttons, sliders and UI elements that help the user understand.

Good luck! You have 50 credits to figure this out.

So the onboarding experience is icy, to put it mildly. Type something in, or don’t. Doesn’t work? Try again, bitch!

Conversely, most apps that are designed to help non-artists make cool, useful images are colourful, vibrant, inspiring, and full of discoverable templates, examples, and ways to get started quickly.

DALL·E is so intimidating in its simplicity, I felt compelled to write a whole (albeit small!) guide to how to use it. And people seemed to really enjoy it. Perhaps too much, because, if you’re relying on internet superfans to write explanatory material, and your users are crying out for it, then your product might just be a little on the user-unfriendly side. How might it be enhanced, quicksmart?

1. A searchable public gallery

Previously, the best way to learn DALL·E was simply to play around with it. And it was a lot of fun!

But now each attempt costs $0.13, users simply can’t afford to experiment endlessly.

Luckily, people can learn quickly by copying others and pooling their knowledge. This way, DALL-E can get a viral growth loop – the more people use it, the more prompt hacks are discovered, and (so long as these hacks are made easily discoverable) these hacks unlock new uses of the product, which attracts more people to use it.

A DALL·E2 pinboard on Pinterest.

A basic example of this loop is Notion or Miro: as they get more users, more people publish templates for increasingly niche use cases, which attracts new kinds of users. The companies don’t have to create all the templates themselves, they just have to do a bit of curation.

And because DALL·E is powered by simple human-readable text strings, it’s actually really easy for a new user to copy, edit and adapt the prompt of another user. There’s no complex walkthrough required – just use a similar sentence! Alas, it’s still really hard to see what else is out there – DALL·E posts are simply scattered across social media, usually without the exact prompt attached.

Stranger still, even the generations that users have actively chosen to ‘Share’ / ‘Make public’ aren’t collected anywhere. Out of tens of millions of images generated in the first three months, fewer than 50 are displayed on the homepage. (And on mobile you can’t even read the relevant prompts!)

This forces users to consult with fan collections and projects like my DALL-E Prompt Book (or this INCREDIBLE site of styles) that can, for practical purposes, only offer one or two examples per ‘prompt’. Or these two hacked-together search engines, that I’m pretty sure haven’t updated in quite some time.

the dalle2.app search site, perhaps in stasis

Conversely, over on MidJourney, you can type in any phrase you can think of, and see dozens and dozens of previous generations that used that expression.

MidJourney search for prompts with ‘picasso’ in.

Over time, there’s all kind of additional functionality you can layer on top of a gallery like this: social functionality like creator profiles and collections, editorial slots like weekly spotlights and challenges, and data-driven enhancements like tag clouds, trending topics and smart result ordering.

But for now, even a dead-simple searchable library would be a game changer. (My only theory as to why it doesn’t exist: OpenAI is concerned the tool could be used to quickly find ‘controversial’ or IP-heavy images by wily journalists or lawyers?)

2. Features that help people build prompts

It’s great that DALL·E generate so many types of image – but how many art styles does the average user actually know?

So it would make sense to offer templates, controls, or wizards – some visual tools, basically – to reveal some tried-and-tested ‘looks’. Like, y’know, every other app or gallery-type thing.

StarryAI’s style picker.

Of course, as users mature, they’ll pride themselves on creating entirely undreamt of pieces using handcrafted text prompts alone. But as we only have 50 prompts to get a newbie user desperate to buy more, so it’s wise to give them SOME way to help them kick-ass straight off the block.

Even better, it remains easy for users to ‘graduate’ to crafting their own prompts. Far from trapping users in the ‘beginner experience’, ‘clicking the colourful buttons’ will still end up crafting a regular text prompt, so the user can easily see and understand the workings of the process:

Click buttons. Get sentence. Now I can see how the buttons create a sentence. In future, I can just write my ownsentences!

Other AI tools (including GPT-3) offer a bunch of ‘tech-y power options’ – entering seed numbers, choosing models, setting ‘temperature’ etc – that are pretty mysterious to the newcomer.

But rolling up some common prompt formats into templates or presets makes lots of sense!

Pitch, Squarespace, Canva: whatever the creative tool, they all come with some initial starter templates so users can create something nice quickly, and get a grip on ‘what it’s all about.’

DALL·E offers a massive and immediate ‘WOW moment’, almost regardless of what you first type in, but the ‘value moment’, where people go ‘OK, yes, I can see why I’d use this!’ needs a little bit of TLC.

Fixing some obvious things

The AI engine itself is still rather extraordinary. But here are some common requests from the community.

3. A website that works a bit more reliably

Let’s be honest: for a webpage that shows a simple text box and a grid of pictures, DALL·E 2 is just a bit too glitchy, especially on mobile, where images can appear weirdly overlaid, the prompt box becomes uneditable, and recent generation history isn’t even visible.

All the tiny thumbnails in the sidebar are secretly full-size images, which means scrolling the whole way down loads about 60MB of data. Maybe that’s why opening DALL·E gets my laptop fans spinning like crazy?

A bit of TLC here seems like a good place to start.

4. Built in uncropping, outpainting, and expanding!

From cosmopolitan.com: ‘The World’s Smartest Artificial Intelligence Just Made Its First Magazine Cover’

Thanks to Cosmo, we know this one’s on the way! Today, if you want to uncrop an image to reveal its surroundings, or create a wider view, you currently need to download your image, upload it to to a tool like Photopea, resize it, and upload it back into DALL·E: all a bit of a pain.

Compared to manifesting the universe on demand, adding blank pixels and resizing canvases isn’t the most technically demanding feature, plus it’s a popular use case. So expect this to appear in DALL·E sooner rather than later.

5. Landscape & portrait images

It seems simple enough – other generators like MidJourney and Nightcafe can generate these easily! In fact, the conspicuous absence of this feature makes me suspect it might have some obscure technical basis.

At the moment, non-square images are achievable through stitching, but as each ‘stitch’ requires an additional credit, that’s not likely to be a popular approach going forward. (You also get start to get style drift across the image.)

Offering a wider range of image dimensions will also make DALL·E viable in more contexts: for instance Instagram Stories (portrait) or social media banners (panorama.)

A portrait image generated in Midjourney.

6. Upscaling to get higher-resolution images

Even more down-to-earth, this one – but as images are now available for commercial use, a heftier resolution than 1024 x 1024 might be welcome for print purposes.

Image upscaled 4x by bigjpg

AI upscalers already exist, but it would be convenient to bring this step to the product itself. An existing algorithm could be adapted – or perhaps DALL·E, with its ‘knowledge’ of the micro-details from high-res training data, could do an even better job than the status quo alternative.

7. New kinds of image formats

A DALL·E photo, animated, and turned into a GIF

For ease of sharing, DALL·E could also offer additional formats for image downloads.

As a site owner, a simple JPG option with a smaller file size would be appreciated!

More ambitiously, there could be slideshow GIFs of multiple outputs, animated videos with various visual effects applied, and other neat stuff that adds some pizzazz to an generated image.

8. A better personal archive that’s impossible to lose by accident

DALL·E currently shows results for your most recent 50 generations in the sidebar. And, if you so choose, you can manually save individual images (one at a time, mind you!) to ‘your collection’, which is a super-basic, totally unsearchable bucket, where you can keep up to 10,000 images.

The ‘collection’ page in DALL·E.

But manually considering ‘what to save’ is secretly pointless – if you bookmark a link to a generation screen and visit it weeks later, all the images are still right there on the back-end!

There’s just no existing way to get to them within the product, other than via your browser history.

In Chrome, go to chrome://history, and search ‘labs.openai [TERM]’ for an old generation

A glorious, ten-version output from the early days of DALL·E’s test phase – all the images from two months ago still load!

Currently, this means so much work can simply be… lost? Forever? (Especially if, let’s say, you use multiple devices.) That was shame enough during the days of limitless free prompts, but now that generations are costing real money to produce, it shouldn’t be possible for users to simply lose track of it.

So: DALL·E ought to make all your past work (which, remember, it is already storing!) retrievable by default, perhaps up to some generous limit of 100,000 images / six months of work (or an option to pay for more storage) and ample warning to download anything that’s on the cusp of deletion.

This personal vault should, of course, be searchable by text prompt. Meanwhile, perhaps the current ‘save’ function could be replaced by a ‘favourite’ option that simply makes it extra-visible in one’s archive.

Finally, it should be possible to create subfolders within one’s collection. As DALL·E is used with more focused intent, work will become more project-based, so it should be possible to group generations together appropriately.

9. Get DALL·E to remember the faces it just made up

DALL·E doesn’t allow uploads of human faces, for safety reasons, but annoyingly, this also includes fictional faces that DALL·E created mere moments previously.

Not a real dude! You made him up like 20 seconds ago!

Perhaps DALL·E could match the face-filled upload to a fingerprint of previously generated images and wave it through, so those generations can also be manipulated at a later point.

Note that built-in uncropping, and a larger archive of stored work, may drastically reduce the need for this.

10. Smarter composition

A frequent complaint is that certain image types – say, a painting of a figure standing up – are cropped too tightly on the first try, so their head and feet are missing. A controllable bias towards ‘zooming out’ and revealing entire figures, whether achieved through the AI engine, reliable prompt engineering, or image templates like ‘frames’, would help alleviate this annoyance. (The portrait-shaped output would also likely help here.)

11. Face fixer

Weird eyes, distorted mouths and other facial glitches are still commonplace in DALL·E, especially for scenes with multiple characters. And because of a deeply human ‘ick’ around weird facial stuff, these artefacts often make an image uncomfortable to use, even for casual internal purposes.

Left: a DALL·E original output. RIght: the same image, processed in ARC’s face fixer.

DALL·E could (optionally) layer on another AI that’s specifically focused on making faces un-weird, either during the initial prompt, or as an additional tool that can be deployed against targeted image areas. Something that lets the user clearly say: ‘this area is meant to depict a normal face. make it more normal and face-y.’

Current best practice seems to involve in-painting faces with very strident, insistent prompts like ‘detailed face, very good face, great facial expression, photorealistic portrait photography, superb face, it’s a face!!! great face’. Not totally reliable.

Basically, TenCent’s face restorer.

12. Image-to-text

Sometimes we discover an image, or series of images, with a certain style, but we don’t know how to put it into words – or, even if we do know how, we can’t be confident that DALL·E would use the exact same terms.

What would you call these types of drawings? via Same Energy

As luck would have it, the basis of DALL·E is actually an image-to-text captioning tool called CLIP. DALL·E just runs it in reverse. If there was a way for users to show an image to CLIP, and explore which adjectives the engine most strongly associates with it, we’d learn how to describe the images we want even more easily.

As an interesting example, Open AI now offers a ‘text similarity’ model that can detect how ‘alike’ different written terms are. Here’s a visual representation:

One could imagine a similar concept for comparing visual descriptors: how similar does DALL·E find ‘Annie Leibovitz’ to ‘van Gogh’ to ‘papier mache’ to ‘unsettling’ to ‘realistic’ to ‘swirling’? Which concepts correlate, are anti-correlated, or totally independent?

13. Text diffs & steerable variations

Currently, users can upload an image and create variations, but we can’t control how it varies – there is no accompanying text prompt, so DALL·E just plays Chinese Whispers with itself to generate similar but not identical ‘sibling’ pictures.

But we know more is possible…

What if we could control variations by saying: ‘keep this, but change this?’

For instance:

‘the cat from this image, but as a watercolor painting.’
‘this beach scene, but at night’
‘this style of painting, but depicting a happy robot rather than a gloomy Dutch peasant.’

This would be incredibly powerful – and indeed it seems possible. But a few things stand in the way:

  • Revealing biases – diffs that make a character ‘richer’, ‘more beautiful’, ‘more trustworthy’, ‘smarter-looking’ and so on may reveal some uncomfortable biases in the training data
  • Harassment – in the other direction, this tool would make it easy to take a target image and make it ‘worse’ in an unflattering way, although the current block on face uploads might help mitigate this.
  • Compute – each step in the video above is presumably another image generated, so a similar output seems quite intense computationally (although of course we might be happy making a big leap in [direction x] without documenting the journey via 100 tiny micro-edits)

14. Prompt queuing, batch generations

Just whenever you get a second. No worries if not!

Sometimes, if we develop a style, we now simply want to apply it to All the Things.

Perhaps in the pay-to-prompt era, we’ll be a little less greedy, but you can easily imagine commercial or hobbyist projects where artists know, well in advance, exactly what they want their 115 generations to be.

So, pro tools that allow the user to queue up a series of prompts, and even automate mass downloading once complete, would be welcome for AI art generation at light industrial scale. And as these tools will only help users spend their money faster, it might happen sooner rather than later….

15. Style savers

Saving a style in Snapseed.

In a visual world with infinite possibilities, many users will instead seek to develop some consistency: both for artists looking to develop and iterate on a recognisable style, and for brands and organisations who want to create their own visual universe that’s connected by a common thread.

So tools that enable users to ‘lock in’ particular looks for re-use could be in high demand, just as we save Lightroom presets, or official brand colours in Canva.

Can DALL·E provide extra control beyond ‘just using very similar prompts each time’?

A more complex idea would be for DALL·E to accept a bank of reference images, or brand moodboard, that can then be used as an aesthetic basis for future creations by that user.

16. Model transparency: candor, updates and consultation

There’s nothing wrong with updating DALL·E iteratively – indeed that’s the only way it will improve! But committed users, who are using DALL-E as a core part of their practice, deserve a bit of insight into what, roughly, is going on.

Does it change every day? Every week? Big updates every so often? What impact do these changes have? Are they being tested? On what kinds of images? How?

Already, conspiracy theories abound that DALL·E is ‘not as good as it was’ or is now failing at some previously accomplished result. (One prominent creator, celebrated on Open AI’s own website, laments it’s no longer possible to create his ‘signature’ look in DALL·E.) As I write, OpenAI is denying they made any recent changes to the model.

But in June, they revealed they’d made an unannounced upgrade two weeks earlier ‘enhancing’ the DALL-E’s abilities at rendering faces. So you it’s this kind of thing that makes people wonder.

Content creators are used to living at the whim of changing algorithms, but generally this in the context of ‘free publicity’ like the Instagram algorithm or Google search results. As DALL·E users are paying for their prompts, and using it like paint or camera film, notification that the formula has shifted and results might change seems reasonable. People don’t suddenly find that Premiere can’t do the Ken Burns effect anymore.

It seems plausible that certain overall ‘optimisations’, say, to improve DALL-E’s photorealism, might also have downsides in a niche, such as reducing DALL-E’s likelihood of generating ‘the fantastical.’ By transparently announcing model changes, just like release notes for apps or games, OpenAI will build trust, and give creators an opportunity to tweak prompts, compare results, deliver feedback, and adapt their practice, rather than blindly repeating their usual processes, wasting money on something which has inexplicably stopped working.

In a perfect world, earlier model versions would still be available for paying users, so artists who’ve developed a consistent style can continue to achieve the same results. Or vice versa: model updates could initially be released as ‘beta versions’, so enthusiasts could test-drive the next iteration in advance, and flag issues before it’s rolled out more widely.

‘A total miss’, from Discord

Transparency around how the AI works moment-to-moment would also be welcome – are images less accurate when the server is under heavy load? If users are getting ‘nonsense’ results, should they try again later? Does OpenAI track what %age of outputs are ‘bad’? How does that undulate over time? Can users recover their credits for poor outputs?

It’s interesting to understand, it adds context, it helps people understand what they’re seeing and why, which helps generate trust, forgiveness and community.

17. Let users turn the de-biaser off

DALL·E recently revealed a laudable initiative to counteract racial and gender biases in DALL-E’s generations, i.e: the tendency for ‘a CEO’ to generate all white men.

But far from an elegant retraining of the DALL·E algorithm, the prompt box is simply scanning prompts for ‘job-like’ words like ‘accountant’, and, then if no race or gender is specified, invisibly adding random terms to the prompt, like so:

‘A smiling carpenter on a sunny day, Japanese man
‘A smiling carpenter on a sunny day, Caucasian woman

You can trick DALL·E into revealing this, by writing a test prompt like ‘A person wearing a t-shirt that says’ – it’s then easy to see that DALL·E has added appended the words ‘male’ and ‘female’ to some of the text prompts:

MALLE and FAMLE. Ah, the two genders

On the whole, this gets the job done, but it can have unexpected impacts.

For instance, trying to prompt an animal to do a job (‘A snail that’s a radio DJ‘) triggers the behaviour, randomly summoning a Black woman.

from the official DALL·E discord

Similarly, the prompt below, which calls for a human face… made from mechanical tools’, now gets a random ‘demographic person’ of attached in the second output:

via the r/Dalle2 subreddit

So, while we’re all for representation, there are occasions where it’s not called for – namely when we aren’t trying to represent a human at all. As customers are paying per prompt, they ought to have the simply option to toggle this setting off if necessary.

In any case, who’s to say what other ‘invisible’ prompt-overrides OpenAI might choose to add in the future? By giving power users the option to block these sneaky, surreptitious suffixes, OpenAI actually give themselves more room to test them – without making them mandatory.

A final benefit to this: the de-biaser is being used a scapegoat for all kinds of failed generations or poorly-defined prompts: people see a certain type of person appear unexpectedly and claim that ‘broke the system.’ (There is a certain reflexive ‘anti-wokeness’ at play here!) So letting people turn off the setting will help them detect not only when the de-biaser is causing problems, but also when it isn’t.

18. More solid legal footing

Although I naturally don’t believe DALL·E is ‘a scam’ or ‘only capable of infringement’, it’s fair to say we are in somewhat uncharted legal waters. And although I expect OpenAI has a solid theory as to why they’re in the clear, I’m sure they anticipate that this theory will inevitably be challenged, with a non-zero chance it’s rejected.

Perhaps as a result, their training set and generation process is shrouded in mystery. Commercial users can’t be clear on the provenance of any given image; photographs of ‘unreal people’ may in fact resemble real people, who then claim the photo is ‘actually of them’, and this time there’s no model release form from a doppelgänger human to prove it isn’t.

Proving ‘a connection’ between an original work and the allegedly infringing work is usually important to prosecuting a case; with 650 million images used as part of the training process, it’s a lot harder for an end user to be confident that no such connection exists (much harder than Ed Sheeran credibly claiming he’s never heard a obscure track on YouTube, for example.)

And there’s more: US law currently asserts that no copyright exists in AI generated artwork. OpenAI claims to have it anyway. (So would I, tbh.) But: they then give the user a license to use the work commercially! It’s all a bit confusing. (Midjourney gives the copyright (that may or may not exist) to the user, then reserves a non-exclusive right to use the image themselves.)

Similarly, ‘permitting commercial use’ is outright misleading (and could get innocent civilians into a lot of trouble!) if OpenAI doesn’t make it clear that other legal phenomena exist (like copyright and trademark rights in fictional characters) that would preclude commercial use despite OpenAI’s blessing.

Steps that could help build confidence and understanding around the legal issues are: a more detailed, user-friendly ToS; guarantees (or transparent tools that verify) any given output is not a close or ‘obviously derivative’ match to a specific work in the training set; or indemnity agreements for corporate or enterprise users.

Does OpenAI actually want it badly enough, though?

Open AI is a leading research organisation with a $1bn warchest, attention-grabbing results and a deep mission to build ‘artificial general intelligence.’ That’s an endgame a lot more ambitious than DALL·E – they aren’t setting out ‘merely’ to democratise art, or become the new Adobe, Getty, or Canva, they’re aiming to create adaptable human-equivalent tools that can accomplish any task.

(Consider that OpenAI’s public products are likely many months behind the state-of-the-art research internally. And although DALL·E’s pictures are amazing, what about an AI assistant that could do your taxes, research topics and write summary reports, book your flights, chase up outstanding work or reach out to contacts – the sort of daily drudge work that a billion knowledge workers would kill to automate? An AI admin assistant, rather than an AI concept artist, might be way less interesting, but way more valuable.)

Reviewing at OpenAI’s current roles and open vacancies on LinkedIn, they don’t strike me as a company MASSIVELY gearing up to serve a million customers, briskly recruiting the kind of front-end, customer-oriented product roles you’d expect for a major consumer/prosumer SaaS.

It’s at least possible that OpenAI have accidentally built a killer tech demo, one real people happen to be desperate to use, and they’re just kind of running with it. But DALL·E 2 is ultimately not the point of this company – they want a service that can accomplish any task a reasonably smart human with internet access can do, but 100x faster. In which case, would you rather have more hardcore AI researchers to get you there, or more marketing/social/product people who are like, A/B testing onboarding email drip sequences and running little paid social campaigns to reduce churn by X% MoM? (No offence to the latter group – I’m literally one of them!)

That’s why OpenAI might feel that devoting resources to DALL·E 2 is just a distraction from their overall strategy. In contrast, companies that are primarily focused on art generation as a business (like Midjourney), are already in the artist/creator space (like Apple and Adobe) or have billions of existing users plus the ops/marketing support to bring it to a mass market (like Google and Meta) might focus on developing these features a lot more aggressively, and ultimately grab the #1 spot in the field.

Maybe it’s all desire paths

There is a product development strategy of starting with a minimum viable product, seeing how people use it, and then adding features based on user behaviour. (One analogy is for a field of grass in a public space: we can wait to see where the grass is flattened from repeated footsteps, and add our asphalt footpaths along those routes.)

So the current lack of certain features (especially the most obvious ones!) may simply reflect on a desire to first see what users, well, desire.

Hopefully the paths in the grass are starting to emerge. It’s time to put down some slabs.

Zeen is a next generation WordPress theme. It’s powerful, beautifully designed and comes with everything you need to engage your visitors and increase conversions.

FOLL-O us