An Experiment Based Evaluation of Logical Reasoning Abilities of GPT-3.5 and GPT -4
Ansh Tiwari1, Ashmit Dubey2, Mahesh Kr Tiwari3 , Rinku Raheja4
1National Post Graduate College, Department of Computer Science
2 National Post Graduate College, Department of Computer Science
3 National Post Graduate College, Department of Computer Science
4 National Post Graduate College, Department of Computer Science
---------------------------------------------------------------------***----------------------------------------------------
Abstract
Employing logical logic capability is a comprehensive natural language understanding bid. With the release of Generative Pretrained Transformer 4( GPT- 4), stressed as" advanced" at logic tasks, we're eager to learn the GPT- 4 performance on colorful logical logic tasks. This report analyses multiple logical logic datasets, with popular marks like LogiQA and ReClor, and recently- released datasets like AR- LSAT. We test themulti-choice reading appreciation and natural language conclusion tasks with marks taking logical logic. We further construct a logical logic out- ofdistribution dataset to probe the robustness of ChatGPT and GPT- 4. A comparison in terms of performance has also been been made between ChatGPT and GPT-4 . Trial results show that ChatGPT performs significantly better than the RoBERTa forfeiture- tuning system on utmost logical logic marks. With early access to the GPT- 4 API we're suitable to conduct violent trials on the GPT- 4 model. The results show GPT- 4 yields indeed higher performance on utmost logical logic datasets. Among marks, ChatGPT and GPT- 4 do fairly well on well- known datasets like LogiQA and ReClor. still, the performance drops significantly when handling recently released and out- of- distribution datasets. Logical logic remains grueling for ChatGPT and GPT- 4, especially on outof- distribution and natural language conclusion datasets. We release the prompt- style logical logic datasets as a standard suite and name it LogiEval.
Keywords: Generative Pretrained Transformer 4 (GPT-4), Natural Language Understanding (NLU), Multi-choice Reading Comprehension, Out-of-distribution Dataset, RoBERTa Fine-tuning.