Ilya Sutskever once was asked about this in an interview and he said that in order to predict the next token, the LLM does actually need to develop some understanding of what the relationship between the tokens is.
What an extremely bizarre use of language that phrase is!
The current LLM technology is
STATIC -- it doesn't "develop" anything, let alone "understand" anything.
An LLM does
encode the relationship between the tokens -- that's it raison d'être! But that is not "understanding" -- it's just "statistical" modelling of the training data.
The "understanding" part comes
only in the human mind of the human reader of the output of an LLM/GPT. People impose their own world model on whatever the GPT produces and make their assumptions from there. You are apparently believing that LLM/GPT systems can "reason" because you are trying to map how you see them operate into your world model of how a "mind" might work.
I'll concede that the feedback mechanisms in LLM/GTP systems that automate "chain-of-thought" processing might be considered to be doing some alien form of "thinking", though I would definitely not go so far as to apply the word "reasoning" to such processing.
Anyway because of the static nature of the backend LLM(s), the lack of "continuous learning" as you said, I wouldn't call anything done by any LLM/GPT system "reasoning", and indeed I don't believe any LLM-based technology can
ever be made to "reason" in the sense I understand the concepts behind that word.
It really doesn't help that the GenerativeAI vendors are all conflating the use of these words in how they describe their systems -- they're using a form of propaganda to make their systems seem more powerful than they really are. Combine that with the ways other GenerativeAI fanatics misunderstand the tools they promote and really it's impossible for the average person to realize that they're being sold an impossible dream at the moment.
Deeper review:
Reasoning involves applying first principles/premises in a logical manner to a problem, and possibly adjusting one's world model as one discovers new information during the process. The so-called automated "chain-of-thought" systems only simulate part of that by trying to break down the prompt into steps, reprocess them again along with the context of the initial output, possibly several times around, and possibly against several different LLMs, in order to try to "firm up" the final response in hopes that it is more realistic, accurate, and factual. It's just an attempt to automate what people were doing with manual "chain of thought prompting". However any form of "CoT" use does not alter the training of the back-end LLMs being are used. Some automated systems apparently create new temporary micro-models during the process which may help guide the selection of backend models for each step, or even rewrite some parts of the prompt (tree-of-thought processing, multi-agent processing).
Either way "CoT" does not make an LLM "reason" -- it simply gives more complete "context" to a prompt or series of prompts -- it's a better way of providing a prompt that increases the likelihood of the GPT predicting the next tokens and generating a more understandable and possibly more accurate result for what we humans would consider to be a more complex multi-step problem.
Multi-modal models don't have anything to do with reasoning or understanding -- they just encode multiple or different media types vs. how an LLM encodes plain text. They aren't really "language" models per se, but perhaps attempts to better model other aspects of the real world beyond human languages, and possibly tie in aspects of a language model to those different media inputs.
Reinforcement learning from human feedback is simply an attempt to "flood" the model with extra training material that the model creator believes will cause it to preferably generate output that humans will understand as more in tune with whatever values and ideals that the model creator prefers. One might worry that Grok was RLHF trained with Nazi and other extreme-right propaganda, for example.
Reinforcement learning for reasoning is simply an attempt by the model creator to "flood" it with material they think will be helpful in emphasizing what humans might consider methods for logical thinking and reasoning. E.g. make it more likely that when a CoT prompt shows an example of breaking down a mathematical formula into the steps necessary to solve it that the output will follow a similar pattern and thus be more likely to produce a correct result. It's not the same as using, say, WolframAlpha to properly parse the formula and calculate the result, though such training might make it more likely that an automated CoT system would choose to use WolframAlpha as an external agent to get the correct results.
However none of these reinforcement learning techniques are dynamic. They all have to be done before the final model is used.