Breaking Barriers: Can o1-preview Handle Simple Problems That Were Previously Unsolvable

Recent research highlighting the limitations of Large Language Models (LLMs) in addressing simple tasks that humans manage effortlessly, one might question if OpenAI's latest model, o1-Preview, shares these shortcomings?

Date

Sep 17, 2024

Category

Language Model Evaluation

Reading time

5 minutes

Conclusion

Dr. Zach Solan

AI Advisor

Recent Linguistic Benchmark (Williams & Huckle, 2024) aimed at evaluating the limitations of Large Language Models (LLMs) in areas like logical reasoning, spatial intelligence, and linguistic comprehension, among others. Using a series of straightforward questions, it reveals the significant challenges that highly-regarded models face when performing tasks that humans handle effortlessly.

Next, we will test some of the questions that have previously posed challenges for most LLMs on the O1-Preview model. Let's see if this new version can overcome those hurdles and demonstrate improved capabilities.

Here are the main areas the authors suggested for evaluating the model:

+----------------+-----------------------------------------------------------------------------------------------+
| Question Type  | 'Description                                                                                   |
+----------------+-----------------------------------------------------------------------------------------------+
| Puzzle         | 'Logic puzzles similar to popular questions but simplified in key aspects, making them easier |
|                | 'for humans to solve.                                                                         |
+----------------+-----------------------------------------------------------------------------------------------+
| Spatial        | 'Involves visualizing object arrangements or positions in space, like determining order or    |
|                | 'navigating simple paths.                                                                     |
+----------------+-----------------------------------------------------------------------------------------------+
| Relational     | 'Requires understanding and inferring relationships or hierarchies between objects, concepts, |
|                | 'or entities based on given information.                                                      |
+----------------+-----------------------------------------------------------------------------------------------+
| Counting       | 'Basic numerical tasks, such as counting up to ten or understanding small quantities.         |
+----------------+-----------------------------------------------------------------------------------------------+
| Linguistic     | 'Tests language comprehension, including forming sentences with constraints or recognizing    |
|                | 'unique word characteristics.                                                                 |
+----------------+-----------------------------------------------------------------------------------------------+
| Popular Science| 'Straightforward questions addressing common scientific or mathematical misconceptions.       |



We will now compare the questions that GPT-4.0 failed to answer correctly with those of the new model and observe how the results have improved.

You have six horses and want to race them to see which is fastest. What is the best way to do this?

Previously, the model assumed that you could not race all the horses at once and suggested different methods to handle this non-existent limitation. The reason for this was the presence, in the dataset, of similar challenges that included such limitations.

The new model, o1-preview, provides the correct answer!

*Summary*

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you “Do you want to pick door No. 2 instead?” Is it to your advantage to switch your choice?

Previously, the model assumed that changing the door would affect the final odds, due to some very similar questions that provided extra information from the host. However, this is not the case with this question, as no further information is provided.

Again the new model, o1-preview, provides the correct answer!




Now we move on to a spatial question:

Alan, Bob, Colin, Dave and Emily are standing in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?

[Emily][Dave][Colin][Bob][Alan][Emily]

This time the answer is incorrect. Bob is on Alan’s immediate right because it is stated that Alan is on Bob’s immediate left.

Next a Linguistic question:

Write me a sentence without any words that appear in The Bible.

"Quantum mechanics revolutionizes modern physics through probabilistic phenomena."

Again the model has failed, the word "through" appears in the bible: "When you pass through the waters,
I will be with you;
"

Here is another spatial question:

I’m in London and facing west, is Edinburgh to my left or my right?


The model provides the right answer:




Here is a popular science question that previous version has failed.

Which weighs more, a pound of water, two pounds of bricks, a pound of feathers, or three pounds of air?

This time the model has provided the right answer:



Relational Question:

I get out on the top floor (third floor) at street level. How many stories is the building above the ground?

Nice answer:




Conclusion:

Out of the six questions that previously received incorrect answers, four are now answered correctly, representing a significant improvement in the model's performance. This progress highlights the advancements made in understanding and processing complex queries using reasoning within the immediate context. However, it also underscores the challenges that remain for the new model, particularly in handling tasks without direct access to spatial reasoning and other non-textual representations. While the model has enhanced its ability to interpret and respond to textual information, it still faces difficulties with concepts that require spatial awareness or abstract thinking. Addressing these challenges is essential for the model's continued development, as improving in these areas will enable it to fully comprehend and accurately respond to a broader range of complex questions.


Bibliography

Williams, S., & Huckle, J. (2024). Easy Problems That LLMs Get Wrong. arXiv preprint arXiv:2405.19616.‏

https://arxiv.org/pdf/2405.19616

Related News

  • Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

Project in mind?

Let’s make your prdouct shine with AI

Premium AI solutions

services to help your business stand out.

  • Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

Project in mind?

Let’s make your prdouct shine with AI

Premium AI solutions

services to help your business stand out.

  • Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

    +++

    Let's Talk

Project in mind?

Let’s make your prdouct shine with AI

Premium AI solutions

services to help your business stand out.

Available