AI models lose their shirts on Premier League bets

00:00

{"text":[[{"start":8.33,"text":"AI models from Google, OpenAI and Anthropic lost money betting on football matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyse the real world over long periods of time. "}],[{"start":25.96,"text":"The “KellyBench” report released this week by AI start-up General Reasoning highlights the gap between AI’s rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems."}],[{"start":43.36,"text":"London-based General Reasoning tested eight top AI systems in a virtual recreation of the 2023-24 Premier League season, providing them with detailed historical data and statistics about each team and previous games. The AIs were instructed to build models that would maximise returns and manage risk. "}],[{"start":65.96000000000001,"text":"The AI “agents” then placed bets on the outcomes of matches and the number of goals scored to test how they could adapt to new events and updated player data as the season progressed. "}],[{"start":78.60000000000001,"text":"The AI could not access the internet to retrieve results and each was given three attempts to turn a profit."}],[{"start":87.53,"text":"Anthropic’s Claude Opus 4.6 fared best, with an average loss of 11 per cent and nearly breaking even on one attempt. "}],[{"start":97.27,"text":"xAI’s Grok 4.20 went bankrupt once and failed to complete the other two tries. Google’s Gemini 3.1 Pro managed to turn a 34 per cent profit on one go but went bankrupt on another. "}],[{"start":112.66,"text":"“Every frontier model we evaluated lost money over the season and many experienced ruin,” the authors of the paper concluded, with the AI “systematically underperforming humans” in this scenario. "}],[{"start":null,"text":"<table class=\"data-table\" data-table-collapse-rownum=\"\" data-table-layout-largescreen=\"auto\" data-table-layout-smallscreen=\"auto\" data-table-theme=\"auto\"><caption></caption><thead><tr><th data-column-hidden=\"none\" data-column-sortable=\"false\" data-column-type=\"string\">AI model</th><th data-column-hidden=\"none\" data-column-sortable=\"false\" data-column-type=\"string\">Mean ROI</th><th data-column-hidden=\"none\" data-column-sortable=\"false\" data-column-type=\"string\">Best try</th><th data-column-hidden=\"none\" data-column-sortable=\"false\" data-column-type=\"string\">Worst try</th><th data-column-hidden=\"none\" data-column-sortable=\"false\" data-column-type=\"string\">Mean Final Bankroll</th></tr></thead><tbody><tr><td>Anthropic Claude Opus 4.6</td><td>−11.0%</td><td>−0.2%</td><td>−18.8%</td><td>£89,035</td></tr><tr><td>OpenAI GPT-5.4</td><td>−13.6%</td><td>−4.1%</td><td>−31.6%</td><td>£86,365</td></tr><tr><td>Google Gemini 3.1 Pro</td><td>−43.3%</td><td>+33.7%</td><td>−100%</td><td>£56,715</td></tr><tr><td>Google Gemini Flash 3.1 LP</td><td>−58.4%</td><td>+24.7%</td><td>−100%</td><td>£41,605</td></tr><tr><td>Z.AI GLM-5</td><td>−58.8%</td><td>−14.3%</td><td>−100%</td><td>£41,221</td></tr><tr><td>Moonshot Kimi K2.5</td><td>−68.3%</td><td>−27.0%</td><td>−100%</td><td>£7,420</td></tr><tr><td>xAI Grok 4.20</td><td>−100%</td><td>−100%</td><td>−100%</td><td>£0</td></tr><tr><td>Arcee Trinity</td><td>−100%</td><td>−100%</td><td>−100%</td><td>£0</td></tr></tbody><tfoot><tr><td colspan=\"1000\"><em>Each model began with a £100,000 normalised bankroll. Return on investment and final bankroll are averaged across three tries. Grok and Trinity did not complete every attempt. </em></td></tr></tfoot></table>"}],[{"start":126.84,"text":"The results offer some comfort to white-collar professionals and businesses who are fretting that AI could take their jobs, as it roils the shares of industries from finance to marketing."}],[{"start":139.93,"text":"Ross Taylor, one of the study’s authors and General Reasoning’s chief executive, said: “There is so much hype about AI automation but there’s not a lot of measurement of putting AI into a longtime horizon setting.”"}],[{"start":153.17000000000002,"text":"He added that many of the benchmarks typically used to test AI are flawed because they are set in “very static environments” that bear little resemblance to the chaos and complexity of the real world. "}],[{"start":167.75000000000003,"text":"General Reasoning’s paper, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about the huge recent leaps in AI’s ability to complete computer programming tasks with little to no human intervention. "}],[{"start":184.08000000000004,"text":"Taylor, a former Meta AI researcher, said: “If you . . . try AI on some real-world tasks, it does really badly . . . Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at.” "}],[{"start":213.49000000000004,"text":""}]],"url":"https://audio.ftcn.net.cn/album/a_1775890949_6076.mp3"}

尊敬的用户您好，这是来自FT中文网的温馨提示：如您对更多FT中文网的内容感兴趣，请在苹果应用商店或谷歌应用市场搜索“FT中文网”，下载FT中文网的官方应用。

{"text":[[{"start":8.33,"text":"AI models from Google, OpenAI and Anthropic lost money betting on football matches over a Premier League season, in a new study suggesting even the most advanced systems struggle to analyse the real world over long periods of time. "}],[{"start":25.96,"text":"The “KellyBench” report released this week by AI start-up General Reasoning highlights the gap between AI’s rapidly advancing capabilities in certain tasks, such as writing software, and its shortcomings in other kinds of human problems."}],[{"start":43.36,"text":"London-based General Reasoning tested eight top AI systems in a virtual recreation of the 2023-24 Premier League season, providing them with detailed historical data and statistics about each team and previous games. The AIs were instructed to build models that would maximise returns and manage risk. "}],[{"start":65.96000000000001,"text":"The AI “agents” then placed bets on the outcomes of matches and the number of goals scored to test how they could adapt to new events and updated player data as the season progressed. "}],[{"start":78.60000000000001,"text":"The AI could not access the internet to retrieve results and each was given three attempts to turn a profit."}],[{"start":87.53,"text":"Anthropic’s Claude Opus 4.6 fared best, with an average loss of 11 per cent and nearly breaking even on one attempt. "}],[{"start":97.27,"text":"xAI’s Grok 4.20 went bankrupt once and failed to complete the other two tries. Google’s Gemini 3.1 Pro managed to turn a 34 per cent profit on one go but went bankrupt on another. "}],[{"start":112.66,"text":"“Every frontier model we evaluated lost money over the season and many experienced ruin,” the authors of the paper concluded, with the AI “systematically underperforming humans” in this scenario. "}],[{"start":null,"text":"
AI model Mean ROI Best try Worst try Mean Final Bankroll
Anthropic Claude Opus 4.6 −11.0% −0.2% −18.8% £89,035
OpenAI GPT-5.4 −13.6% −4.1% −31.6% £86,365
Google Gemini 3.1 Pro −43.3% +33.7% −100% £56,715
Google Gemini Flash 3.1 LP −58.4% +24.7% −100% £41,605
Z.AI GLM-5 −58.8% −14.3% −100% £41,221
Moonshot Kimi K2.5 −68.3% −27.0% −100% £7,420
xAI Grok 4.20 −100% −100% −100% £0
Arcee Trinity −100% −100% −100% £0
Each model began with a £100,000 normalised bankroll. Return on investment and final bankroll are averaged across three tries. Grok and Trinity did not complete every attempt.
"}],[{"start":126.84,"text":"The results offer some comfort to white-collar professionals and businesses who are fretting that AI could take their jobs, as it roils the shares of industries from finance to marketing."}],[{"start":139.93,"text":"Ross Taylor, one of the study’s authors and General Reasoning’s chief executive, said: “There is so much hype about AI automation but there’s not a lot of measurement of putting AI into a longtime horizon setting.”"}],[{"start":153.17000000000002,"text":"He added that many of the benchmarks typically used to test AI are flawed because they are set in “very static environments” that bear little resemblance to the chaos and complexity of the real world. "}],[{"start":167.75000000000003,"text":"General Reasoning’s paper, which has not yet been peer reviewed, provides a counterweight to growing excitement in Silicon Valley about the huge recent leaps in AI’s ability to complete computer programming tasks with little to no human intervention. "}],[{"start":184.08000000000004,"text":"Taylor, a former Meta AI researcher, said: “If you . . . try AI on some real-world tasks, it does really badly . . . Yes, software engineering is very important and economically valuable, but there are lots of other activities with longer time horizons that are important to look at.” "}],[{"start":213.49000000000004,"text":""}]],"url":"https://audio.ftcn.net.cn/album/a_1775890949_6076.mp3"}

AI models lose their shirts on Premier League bets

FT商学院

相关话题

公司威胁涨价，消费者将面临更多痛苦

中国收紧对生产商竞争的监管后，太阳能电池板价格上涨

为何伊朗战争未必会加速向低碳能源转型

英国大选的关键议题是什么？

特朗普家族加密项目起诉孙宇晨诽谤

控制科学——一场针对管理者的片面指控

AI model	Mean ROI	Best try	Worst try	Mean Final Bankroll
Anthropic Claude Opus 4.6	−11.0%	−0.2%	−18.8%	£89,035
OpenAI GPT-5.4	−13.6%	−4.1%	−31.6%	£86,365
Google Gemini 3.1 Pro	−43.3%	+33.7%	−100%	£56,715
Google Gemini Flash 3.1 LP	−58.4%	+24.7%	−100%	£41,605
Z.AI GLM-5	−58.8%	−14.3%	−100%	£41,221
Moonshot Kimi K2.5	−68.3%	−27.0%	−100%	£7,420
xAI Grok 4.20	−100%	−100%	−100%	£0
Arcee Trinity	−100%	−100%	−100%	£0
Each model began with a £100,000 normalised bankroll. Return on investment and final bankroll are averaged across three tries. Grok and Trinity did not complete every attempt.