FutureBench: Benchmark for AI agents predicting future events via News-Generated questions and Polymarket data
AI Impact Summary
FutureBench introduces a forecasting-focused benchmark for AI agents, using News-Generated Questions produced by a smolagents-based agent reading front-page articles and Polymarket predictions to generate time-bound tasks. The evaluation isolates the impact of frameworks (e.g., LangChain vs CrewAI), search/tools (Tavily vs other engines), and models (DeepSeek-V3 vs GPT-4) on predictive reasoning, with outcomes that are verifiable and time-stamped. For technical teams, adopting FutureBench implies building robust data pipelines and governance around live data sources to produce reproducible forecast metrics that inform tooling and model choices.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info