Has the honeymoon period for AI language models passed? Is the quality of GPT series products on the decline?

share
Has the honeymoon period for AI language models passed? Is the quality of GPT series products on the decline?

The updated version of GPT-4 released in June has faced criticism and backlash from thousands of paying users for its quality. Research papers have pointed out that the development and user experience of GPT have worsened over time. However, whether defining its performance as good or bad, or providing evidence of functional degradation, there are still many details that need to be interpreted.

Is GPT Deteriorating with Each Update?

Recently, a paper discussing whether the behavior of " ChatGPT changes over time" has been widely circulated and discussed. The content of the paper implies that GPT-4 has been deteriorating since its launch.

The paper tested GPT-3.5 and GPT-4 on four tasks, including math problems, prime number checking, sensitive question responses, code generation, and visual reasoning. The data shows that GPT-4 has shown changes in the quality of answers in math problems and code generation tasks, with the math problems being particularly noteworthy.

It is evident that in terms of accuracy in answering math problems, there have been significant changes between GPT-4 and GPT-3.5, with the former deteriorating while the latter improving. The study points out that GPT-4 almost always tends to guess a number is composite when judging all prime numbers, lacking logical reasoning structures, hence seen as a performance decline.

In the test of code generation, the paper found that compared to the version from March, the June version of GPT-4, when generating and correcting code, did not fully assess the correctness of the code, making the generated code unable to be executed directly.

Consistent User Feedback

Some users expressed on Twitter that the quality of GPT series products has indeed diminished in terms of accuracy in answering questions after recent updates.

OpenAI developer Logan.GPT also publicly responded to numerous comments, thanking users for their feedback on the GPT-4 user experience and initiating an investigation.

Questions Raised on Evaluation Standards in the Paper

However, the above arguments have been questioned for oversimplifying the definition of the performance behavior of language models and the good or bad behavior, and the content still needs further discussion.

An article from Substack argues: "Changes in performance behavior of language models on specific tasks do not necessarily mean a decrease in their capabilities."

The writer states that in the context of chatbots, capability refers to the model's ability to understand and process language, while behavior refers to how the model responds based on different prompts and questions.

Regarding math problems, the writer explains that GPT-4 did not reason about " Chain of Thought, COT" as expected. However, in reality, all four models performed poorly, merely guessing based on the way they were updated and corrected.

The writer also believes that the behavioral changes of GPT-4 in math problems may be due to the selection of test data (almost 500 questions testing only prime numbers) and improper evaluation methods, rather than a degradation in its capabilities.

The article concludes:

In conclusion, this paper also tells us how challenging it is to apply artificially designed indicators or evaluation standards to discuss the performance changes of AI language training models.