What Is ChatGPT Vision? 7 Ways People Are Using This Wild New Feature

2023-10-06 12:00

ChatGPT can now read and respond to image prompts, and in contrast to the doom

What Is ChatGPT Vision? 7 Ways People Are Using This Wild New Feature

ChatGPT can now read and respond to image prompts, and in contrast to the doom and gloom that normally comes alongside news of AI getting more powerful, this new capability seems to have captured the interest of AI users.

OpenAI calls this feature GPT-4 with vision (GPT-4V). The ability to interpret images, not just text prompts, makes the AI chatbot a "multimodal" large language model (because we really needed more AI jargon), and has the potential to redefine how people use AI. Here's everything we know about it so far.

What Is GPT-4V And How Do I Access It?

With a $20-per-month ChatGPT Plus account, you can upload an image to the ChatGPT app on iOS or Android and ask it a question. Give it a photo of your meal at a restaurant, for example, and ask: "How do I make this?" The chatbot will scan the image and return its proposed recipe.

The applications are seemingly endless. OpenAI says multimodalities are a "key frontier in artificial intelligence research and development," as they expand the range of tasks these systems can help users with. A group of researchers at Microsoft called GPT-4V the "dawn of LLMs," and concluded GPT-4V could "give rise to new human-computer interaction methods."

How Did OpenAI Build GPT-4V?

While GPT-4V is new to the public, OpenAI has been working on it since last year, possibly before the chatbot was publicly released in November 2022, according to a technical paper. User testing and training began in March 2023.

"As GPT-4 is the technology behind the visual capabilities of GPT-4V, its training process was the same," OpenAI says. The company fed it more and more complex data, using the same technique as the text-based prompts—reinforcement learning from human feedback (RLHF)— to teach it how to produce answers that humans like.

Throughout this process, OpenAI uncovered enough issues to delay the feature's launch until now. To the company's credit, it tried to find ways the system could fail or act unethically. This includes requests for harmful or illegal content, inaccuracies based on demographics like race and gender, and cybersecurity breaches like solving CAPTCHAs and jailbreaking.

Externally, OpenAI engaged scientists and doctors to verify GPT4-V's advice, finding numerous inaccuracies.

GPT-4V inaccurately identifies chemical structures and poisonous foods. (Credit: OpenAI)

Regarding disinformation and social harms, early versions of GPT-4V would comment inappropriately on sensitive topics such as whether or not to hire a pregnant woman or someone from a certain country. The system also wouldn't recognize symbols used by hate groups or harmful phrases.

After all this testing, OpenAI says it was able to improve the system enough to be acceptable for public use, citing the fact that 97.2% of the requests for "illicit advice" are now refused, for example.

Early versions of GPT-4V repeat "ungrounded" stereotypes, but the launch version refuses the request. (Credit: OpenAI)

It's still a work in progress. OpenAI says it has "fundamental questions around behaviors the models should or should not be allowed to engage in." This includes whether it should identify public figures in images, and infer race, gender, or emotions from people in an image (and if it can do so accurately). Its performance in non-English languages is also pretty sub-par.

Users may also notice inaccuracies. For example, a research team at Microsoft found GPT-4V answered some simple image prompts incorrectly, like misreading a speedometer.

(Credit: The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision), https://arxiv.org/abs/2309.17421)

How to Use GPT-4V

While we can expect GPT-4V to continue improving over time, what it can do today is pretty incredible. Here are some ways ChatGPT Plus users are already experimenting with it.

1. Get A Second Opinion

This painter asked how to make her work more realistic. You could even ask ChatGPT to critique its own AI creations from Dall-E.

A product designer submitted a web mockup, and GPT-4V noticed a few strengths and weaknesses, such as no navigation bar up top.

2. Answer Age-Old Questions, Like 'Where's Waldo?'

Bonus points if you can find someone in real life named named Waldo. Fun fact: Usage of the name has plummeted since its 1915 peak.

3. Identify Obscure Images

One user turned GPT-4V into a junior cartographer by asking it to identify an old map.

4. Write Code

Take a whiteboarding session from concept to reality, or ask it to code a web page inspired by an image. (Can we have AI hairdressers next?)

5. Interpret Tricky Diagrams

The applications for homework and work-work could be endless.

6. Avoid a Parking Ticket

Next thing we know, ChatGPT screenshots might end up in court: "ChatGPT said I could park here!"

7. Identify Landmarks

The ChatGPT app could help you get the most out of your travels, or at least help answer your kids' questions.

(Credit: The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision), https://arxiv.org/abs/2309.17421)

Are Multimodal LLMS the Future of AI?

With all the AI hype over the past year, it's getting hard to tell which trends will stick. OpenAI's last "game-changing" update to ChatGPT, plugins, initially created the same social media storm of people posting their examples, but has since died down. Other features, such as the Browse with Bing function that gives the chatbot access to data before 2021, was enabled, then disabled after being exploited for illicit activity, and is now back on.

Tentatively, what we're seeing from GPT-4V seems promising. "The [AI] community might move more to vision/perception," says Hao Zhang, a professor at University of California, San Diego (UCSD), who works on evaluating LLMs.

OpenAI also recently invested in an improved version of its Dall-E image generator, and announced plans to integrate it into ChatGPT as well.

Keep an eye on competing chatbots. Will Google integrate Lens into Bard? It's possible this is another flash in the pan, but it could be the tip of the AI iceberg.