This study investigated how large language models (LLMs) like Claude 3 and GPT-4 perform on regression tasks using in-context examples without additional training. It showed that LLMs could carry out both linear and non-linear regression, matching or even outperforming supervised learning methods. The study also explored the impact of dataset familiarity and potential data contamination, finding that explicit knowledge of the dataset name did not significantly affect the LLMs’ performance. Additional investigations included the performance of non-transformer models and the effectiveness of LLMs on non-numerical regression tasks. The results suggest that LLMs are versatile tools capable of understanding and applying mathematical concepts learned during their pre-training phase.
Main Points
LLMs are capable of regression tasks using in-context examples.
LLMs can perform both linear and non-linear regression without being specifically trained for it, rivaling traditional supervised methods like Linear Regression and Gradient Boosting.
Performance of LLMs improves with more in-context examples.
In-context training examples increase the performance of LLMs, with very capable models like Claude 3 and GPT-4 approaching near-optimal decision quality over time.
Data contamination concerns are addressed by showing unchanged performance whether or not LLMs 'know' the dataset.
An experiment comparing models’ performance with and without knowledge of the dataset name showed no significant difference, addressing concerns about data contamination.
Insights
The performance of LLMs on various regression tasks improved with the increase in the number of in-context examples.
As the number of in-context exemplars increased, the performance of models such as GPT-4 and Claude 3 continued to show improvement, indicating their ability to leverage more data effectively.
Non-transformer LLMs like RWKV and StripedHyena also demonstrate the ability to perform regression tasks, albeit generally not as effectively as transformer-based models.
While RWKV displayed a performance superior to unsupervised models, it was generally outperformed by transformer-based LLMs, and StripedHyena encountered difficulties, producing invalid outputs in some scenarios.
Explicit knowledge of the dataset name doesn't significantly affect LLM performance, addressing potential data contamination concerns.
Comparative analysis showed that the performance of models is not significantly affected by whether or not they are made aware of the dataset name they are being evaluated on, suggesting reduced likelihood of data contamination influencing results.