In [1]:
%pip uninstall --yes 'keras' 'matplotlib' 'scikit-learn' 'tensorflow'
Found existing installation: keras 3.10.0 Uninstalling keras-3.10.0: Successfully uninstalled keras-3.10.0 Found existing installation: matplotlib 3.10.0 Uninstalling matplotlib-3.10.0: Successfully uninstalled matplotlib-3.10.0 Found existing installation: scikit-learn 1.6.1 Uninstalling scikit-learn-1.6.1: Successfully uninstalled scikit-learn-1.6.1 Found existing installation: tensorflow 2.19.0 Uninstalling tensorflow-2.19.0: Successfully uninstalled tensorflow-2.19.0 Note: you may need to restart the kernel to use updated packages.
In [2]:
import warnings
warnings.simplefilter('ignore')
In [3]:
import os
import sys
import subprocess
In [4]:
def set_env(input_archive, temp_dir):
if not os.path.exists(temp_dir):
os.makedirs(temp_dir, exist_ok=True)
subprocess.run(['tar', '-xzf', input_archive, '-C', temp_dir], check=True)
subprocess.run([
sys.executable,
'-m',
'pip',
'install',
'--q',
'--no-index',
'--find-links',
f'{temp_dir}/wheels',
'unsloth',
'trl',
'vllm',
'openai_harmony'
], check=True)
In [5]:
set_env(
input_archive='/kaggle/input/aimo-3-utils/wheels.tar.gz',
temp_dir='/kaggle/tmp/setup'
)
In [6]:
subprocess.run(['ls', '/kaggle/tmp/setup/tiktoken_encodings'])
cl100k_base.tiktoken o200k_base.tiktoken
Out[6]:
CompletedProcess(args=['ls', '/kaggle/tmp/setup/tiktoken_encodings'], returncode=0)
In [7]:
os.environ['TRANSFORMERS_NO_TF'] = '1'
os.environ['TRANSFORMERS_NO_FLAX'] = '1'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
os.environ['TRITON_PTXAS_PATH'] = '/usr/local/cuda/bin/ptxas'
os.environ['TIKTOKEN_ENCODINGS_BASE'] = '/kaggle/tmp/setup/tiktoken_encodings'
In [8]:
import gc
import re
import math
import time
import queue
import threading
import contextlib
from typing import Optional
from jupyter_client import KernelManager
from collections import Counter, defaultdict
from concurrent.futures import as_completed, ThreadPoolExecutor
import pandas as pd
import polars as pl
from openai import OpenAI
from openai_harmony import (
HarmonyEncodingName, load_harmony_encoding,
SystemContent, ReasoningEffort, ToolNamespaceConfig,
Author, Message, Role, TextContent, Conversation
)
from transformers import set_seed
import kaggle_evaluation.aimo_3_inference_server
In [9]:
# ─────────────────────────────────────────────────────────────────────────────
# CFG
# ─────────────────────────────────────────────────────────────────────────────
class CFG:
system_prompt = (
'You are an elite mathematical problem solver with expertise at the International '
'Mathematical Olympiad (IMO) level. Your goal is to find the correct answer through '
'rigorous mathematical reasoning.\n\n'
'# Problem-Solving Approach:\n'
'1. UNDERSTAND: Carefully read and rephrase the problem in your own words. '
'Identify what is given, what needs to be found, and any constraints.\n'
'2. EXPLORE: Consider multiple solution strategies. Think about relevant theorems, '
'techniques, patterns, or analogous problems. Don\'t commit to one approach immediately.\n'
'3. PLAN: Select the most promising approach and outline key steps before executing.\n'
'4. EXECUTE: Work through your solution methodically. Show all reasoning steps clearly.\n'
'5. VERIFY: Check your answer by substituting back, testing edge cases, or using '
'alternative methods. Ensure logical consistency throughout.\n\n'
'# Mathematical Reasoning Principles:\n'
'- Break complex problems into smaller, manageable sub-problems\n'
'- Look for patterns, symmetries, and special cases that provide insight\n'
'- Use concrete examples to build intuition before generalizing\n'
'- Consider extreme cases and boundary conditions\n'
'- If stuck, try working backwards from the desired result\n'
'- Be willing to restart with a different approach if needed\n\n'
'# Verification Requirements:\n'
'- Cross-check arithmetic and algebraic manipulations\n'
'- Verify that your solution satisfies all problem constraints\n'
'- Test your answer with simple cases or special values when possible\n'
'- Ensure dimensional consistency and reasonableness of the result\n\n'
'# Output Format:\n'
'The final answer must be a non-negative integer between 0 and 99999.\n'
'Place your final numerical answer inside \\boxed{}, e.g., \\boxed{42}\n\n'
'Think step-by-step and show your complete reasoning process. Quality of reasoning '
'is as important as the final answer.'
)
system_prompt_alt = (
'You are an elite mathematical problem solver. '
'Your goal is to find the correct answer through rigorous reasoning.\n\n'
'# IMPORTANT: Use a DIFFERENT approach than the most obvious one.\n'
'- If the problem seems to call for algebra, try a combinatorial approach\n'
'- If it seems combinatorial, try generating small cases computationally\n'
'- Always verify your answer independently before committing\n'
'- Be especially suspicious if your answer feels "too clean" or "too round"\n\n'
'# Mathematical Reasoning Principles:\n'
'- Break complex problems into smaller, manageable sub-problems\n'
'- Look for patterns, symmetries, and special cases that provide insight\n'
'- Use concrete examples to build intuition before generalizing\n'
'- Consider extreme cases and boundary conditions\n'
'- If stuck, try working backwards from the desired result\n'
'- Be willing to restart with a different approach if needed\n\n'
'# Verification Requirements:\n'
'- Cross-check arithmetic and algebraic manipulations\n'
'- Verify that your solution satisfies all problem constraints\n'
'- Test your answer with simple cases or special values when possible\n'
'- Ensure dimensional consistency and reasonableness of the result\n\n'
'# Output Format:\n'
'The final answer must be a non-negative integer between 0 and 99999.\n'
'Place your final numerical answer inside \\boxed{}, e.g., \\boxed{42}\n\n'
'Think step-by-step. Show complete reasoning.'
)
system_prompt_verify = (
'You are an elite mathematical problem solver. '
'Your goal is to find the correct answer through rigorous reasoning '
'WITH MANDATORY COMPUTATIONAL VERIFICATION.\n\n'
'# REQUIRED APPROACH:\n'
'1. Solve the problem analytically first\n'
'2. You MUST write Python code to verify your answer by brute force:\n'
' - If the problem asks "how many X", enumerate ALL X and count them\n'
' - If the problem asks for a maximum/minimum, check all candidates\n'
' - Compare your computational result to your analytical result\n'
' - If they disagree, trust the computation and re-examine your analysis\n'
'3. Only submit an answer after computation confirms it\n\n'
'# Mathematical Reasoning Principles:\n'
'- Break complex problems into smaller, manageable sub-problems\n'
'- Look for patterns, symmetries, and special cases that provide insight\n'
'- Use concrete examples to build intuition before generalizing\n'
'- Consider extreme cases and boundary conditions\n'
'- If stuck, try working backwards from the desired result\n'
'- Be willing to restart with a different approach if needed\n\n'
'# Verification Requirements:\n'
'- Cross-check arithmetic and algebraic manipulations\n'
'- Verify that your solution satisfies all problem constraints\n'
'- ALWAYS run a brute-force enumeration to confirm the final answer\n'
'- If computation and analysis disagree, resolve before answering\n\n'
'# Output Format:\n'
'The final answer must be a non-negative integer between 0 and 99999.\n'
'Place your final numerical answer inside \\boxed{}, e.g., \\boxed{42}\n\n'
'Think step-by-step. Verify computationally. Never skip the brute-force check.'
)
tool_prompt = (
'Use this tool to execute Python code for:\n'
'- Complex calculations that would be error-prone by hand\n'
'- Numerical verification of analytical results\n'
'- Generating examples or testing conjectures\n'
'- Visualizing problem structure when helpful\n'
'- Brute-force verification for small cases\n\n'
'The environment is a stateful Jupyter notebook. Code persists between executions.\n'
'Always use print() to display results. Write clear, well-commented code.\n\n'
'Remember: Code should support your mathematical reasoning, not replace it. '
'Explain what you\'re computing and why before running code.'
)
preference_prompt = (
'You have access to `math`, `numpy`, and `sympy` for:\n\n'
'# Symbolic Computation (sympy):\n'
'- Algebraic manipulation and simplification\n'
'- Solving equations and systems of equations\n'
'- Symbolic differentiation and integration\n'
'- Number theory functions (primes, divisors, modular arithmetic)\n'
'- Polynomial operations and factorization\n'
'- Working with mathematical expressions symbolically\n\n'
'# Numerical Computation (numpy):\n'
'- Array operations and linear algebra\n'
'- Efficient numerical calculations for large datasets\n'
'- Matrix operations and eigenvalue problems\n'
'- Statistical computations\n\n'
'# Mathematical Functions (math):\n'
'- Standard mathematical functions (trig, log, exp)\n'
'- Constants like pi and e\n'
'- Basic operations for single values\n\n'
'Best Practices:\n'
'- Use sympy for exact symbolic answers when possible\n'
'- Use numpy for numerical verification and large-scale computation\n'
'- Combine symbolic and numerical approaches: derive symbolically, verify numerically\n'
'- Document your computational strategy clearly\n'
'- Validate computational results against known cases or theoretical bounds'
)
served_model_name = 'gpt-oss'
model_path = '/kaggle/input/gpt-oss-120b/transformers/default/1'
kv_cache_dtype = 'fp8_e4m3'
dtype = 'auto'
high_problem_timeout = 900
base_problem_timeout = 300
notebook_limit = 17400
server_timeout = 180
session_timeout = 960
jupyter_timeout = 6
sandbox_timeout = 3
stream_interval = 200
context_tokens = 65536
buffer_tokens = 512
search_tokens = 32
top_logprobs = 5
batch_size = 256
# ── Convergence parameters ────────────────────────────────────────
# Three conditions must ALL be satisfied to stop:
# 1. top_count >= early_stop (absolute vote strength)
# 2. lead >= min_lead_* (gap between top and second)
# 3. total_valid >= min_valid_answers (enough data to trust lead)
#
# min_valid_answers replaces both the old min_valid_for_lead AND
# min_attempts_before_early_stop — they were the same concept.
early_stop = 5 # top answer needs at least 5 votes
min_lead_first_batch = 3 # lead required to stop in batch 1
min_lead_later_batches = 2 # lead required to stop in batch 2+
min_valid_answers = 6 # min valid answers before any stop fires
attempts = 9 # attempts per batch (every batch)
workers = 16
turns = 128
seed = 42
# Per-position temperature — same schedule reused every batch
attempt_temperatures = [
0.2, # position 0 — standard
0.2, # position 1 — standard
0.2, # position 2 — standard
0.2, # position 3 — verify
0.6, # position 4 — verify
0.6, # position 5 — verify
0.6, # position 6 — alt
0.6, # position 7 — alt
0.6, # position 8 — alt
]
# Prompt assignment per position within each batch
# verify takes priority over alt
# position 0,1,2 → standard
# position 3,4,5 → verify (brute-force enumeration required)
# position 6,7,8 → alt (different approach required)
verify_prompt_attempts = {3, 4, 5}
alt_prompt_attempts = {6, 7, 8}
# Extended time budget: 4x base, capped at 4x high_problem_timeout
budget_multiplier = 3
extended_timeout_cap = high_problem_timeout * 3 # 2700s
gpu_memory_utilization = 0.96
min_p = 0.02
set_seed(CFG.seed)
# ─────────────────────────────────────────────────────────────────────────────
# Template
# ─────────────────────────────────────────────────────────────────────────────
class AIMO3Template:
def __init__(self):
pass
def get_system_content(
self,
system_prompt: str,
tool_config: ToolNamespaceConfig
) -> SystemContent:
return (
SystemContent.new()
.with_model_identity(system_prompt)
.with_reasoning_effort(reasoning_effort=ReasoningEffort.HIGH)
.with_tools(tool_config)
)
def apply_chat_template(
self,
system_prompt: str,
user_prompt: str,
tool_config: ToolNamespaceConfig
) -> list[Message]:
system_content = self.get_system_content(system_prompt, tool_config)
system_message = Message.from_role_and_content(Role.SYSTEM, system_content)
user_message = Message.from_role_and_content(Role.USER, user_prompt)
return [system_message, user_message]
# ─────────────────────────────────────────────────────────────────────────────
# Sandbox
# ─────────────────────────────────────────────────────────────────────────────
class AIMO3Sandbox:
_port_lock = threading.Lock()
_next_port = 50000
@classmethod
def _get_next_ports(cls, count: int = 5) -> list[int]:
with cls._port_lock:
ports = list(range(cls._next_port, cls._next_port + count))
cls._next_port += count
return ports
def __init__(self, timeout: float):
self._default_timeout = timeout
self._owns_kernel = False
self._client = None
self._km = None
ports = self._get_next_ports(5)
env = os.environ.copy()
env['PYDEVD_DISABLE_FILE_VALIDATION'] = '1'
env['PYDEVD_WARN_EVALUATION_TIMEOUT'] = '0'
env['JUPYTER_PLATFORM_DIRS'] = '1'
env['PYTHONWARNINGS'] = 'ignore'
env['MPLBACKEND'] = 'Agg'
self._km = KernelManager()
self._km.shell_port = ports[0]
self._km.iopub_port = ports[1]
self._km.stdin_port = ports[2]
self._km.hb_port = ports[3]
self._km.control_port = ports[4]
self._km.start_kernel(
env=env,
extra_arguments=['--Application.log_level=CRITICAL']
)
self._client = self._km.blocking_client()
self._client.start_channels()
self._client.wait_for_ready(timeout=self._default_timeout)
self._owns_kernel = True
self.execute(
'import math\n'
'import numpy\n'
'import sympy\n'
'import itertools\n'
'import collections\n'
'import mpmath\n'
'mpmath.mp.dps = 64\n'
)
def _format_error(self, traceback: list[str]) -> str:
clean_lines = []
for frame in traceback:
clean_frame = re.sub(r'\x1b\[[0-9;]*m', '', frame)
if 'File "' in clean_frame and 'ipython-input' not in clean_frame:
continue
clean_lines.append(clean_frame)
return ''.join(clean_lines)
def execute(self, code: str, timeout: float | None = None) -> str:
client = self._client
effective_timeout = timeout or self._default_timeout
msg_id = client.execute(
code,
store_history=True,
allow_stdin=False,
stop_on_error=False
)
stdout_parts = []
stderr_parts = []
start_time = time.time()
while True:
if time.time() - start_time > effective_timeout:
self._km.interrupt_kernel()
return f'[ERROR] Execution timed out after {effective_timeout} seconds'
try:
msg = client.get_iopub_msg(timeout=1.0)
except queue.Empty:
continue
if msg.get('parent_header', {}).get('msg_id') != msg_id:
continue
msg_type = msg.get('msg_type')
content = msg.get('content', {})
if msg_type == 'stream':
text = content.get('text', '')
if content.get('name') == 'stdout':
stdout_parts.append(text)
else:
stderr_parts.append(text)
elif msg_type == 'error':
stderr_parts.append(
self._format_error(content.get('traceback', []))
)
elif msg_type in {'execute_result', 'display_data'}:
data = content.get('data', {})
text = data.get('text/plain')
if text:
stdout_parts.append(
text if text.endswith('\n') else f'{text}\n'
)
elif msg_type == 'status':
if content.get('execution_state') == 'idle':
break
stdout = ''.join(stdout_parts)
stderr = ''.join(stderr_parts)
if stderr:
return f'{stdout.rstrip()}\n{stderr}' if stdout else stderr
return stdout if stdout.strip() else '[WARN] No output. Use print() to see results.'
def close(self):
with contextlib.suppress(Exception):
if self._client:
self._client.stop_channels()
if self._owns_kernel and self._km is not None:
with contextlib.suppress(Exception):
self._km.shutdown_kernel(now=True)
with contextlib.suppress(Exception):
self._km.cleanup_resources()
def reset(self):
self.execute(
'%reset -f\n'
'import math\n'
'import numpy\n'
'import sympy\n'
'import itertools\n'
'import collections\n'
'import mpmath\n'
'mpmath.mp.dps = 64\n'
)
def __del__(self):
self.close()
# ─────────────────────────────────────────────────────────────────────────────
# Tool
# ─────────────────────────────────────────────────────────────────────────────
class AIMO3Tool:
def __init__(
self,
local_jupyter_timeout: float,
tool_prompt: str,
sandbox=None
):
self._local_jupyter_timeout = local_jupyter_timeout
self._tool_prompt = tool_prompt
self._jupyter_session = sandbox
self._owns_session = sandbox is None
self._execution_lock = threading.Lock()
self._init_lock = threading.Lock()
def _ensure_session(self):
if self._jupyter_session is None:
with self._init_lock:
if self._jupyter_session is None:
self._jupyter_session = AIMO3Sandbox(
timeout=self._local_jupyter_timeout
)
def _ensure_last_print(self, code: str) -> str:
lines = code.strip().split('\n')
if not lines:
return code
last_line = lines[-1].strip()
if 'print' in last_line or 'import' in last_line:
return code
if not last_line or last_line.startswith('#'):
return code
lines[-1] = 'print(' + last_line + ')'
return '\n'.join(lines)
@property
def instruction(self) -> str:
return self._tool_prompt
@property
def tool_config(self) -> ToolNamespaceConfig:
return ToolNamespaceConfig(
name='python',
description=self.instruction,
tools=[]
)
def _make_response(self, output: str, channel: str | None = None) -> Message:
content = TextContent(text=output)
author = Author(role=Role.TOOL, name='python')
message = Message(
author=author, content=[content]
).with_recipient('assistant')
if channel:
message = message.with_channel(channel)
return message
def process_sync_plus(self, message: Message) -> list[Message]:
self._ensure_session()
raw_script = message.content[0].text
final_script = self._ensure_last_print(raw_script)
with self._execution_lock:
try:
output = self._jupyter_session.execute(final_script)
except TimeoutError as exc:
output = f'[ERROR] {exc}'
return [self._make_response(output, channel=message.channel)]
# ─────────────────────────────────────────────────────────────────────────────
# Solver
# ─────────────────────────────────────────────────────────────────────────────
class AIMO3Solver:
def __init__(self, cfg, port: int = 8000):
self.cfg = cfg
self.port = port
self.base_url = f'http://0.0.0.0:{port}/v1'
self.api_key = 'sk-local'
self.template = AIMO3Template()
self.encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
self.stop_token_ids = self.encoding.stop_tokens_for_assistant_actions()
self._preload_model_weights()
self.server_process = self._start_server()
self.client = OpenAI(
base_url=self.base_url,
api_key=self.api_key,
timeout=self.cfg.session_timeout
)
self._wait_for_server()
self._initialize_kernels()
self.notebook_start_time = time.time()
self.problems_remaining = 50
# ── Model loading & server ────────────────────────────────────────
def _preload_model_weights(self) -> None:
print(f'Loading model weights from {self.cfg.model_path} into OS Page Cache...')
start_time = time.time()
files_to_load = []
total_size = 0
for root, _, files in os.walk(self.cfg.model_path):
for file_name in files:
file_path = os.path.join(root, file_name)
if os.path.isfile(file_path):
files_to_load.append(file_path)
total_size += os.path.getsize(file_path)
def _read_file(path: str) -> None:
with open(path, 'rb') as f:
while f.read(1024 * 1024 * 1024):
pass
with ThreadPoolExecutor(max_workers=self.cfg.workers) as executor:
list(executor.map(_read_file, files_to_load))
elapsed = time.time() - start_time
print(
f'Processed {len(files_to_load)} files '
f'({total_size / 1e9:.2f} GB) in {elapsed:.2f} seconds.\n'
)
def _start_server(self) -> subprocess.Popen:
cmd = [
sys.executable, '-m', 'vllm.entrypoints.openai.api_server',
'--seed', str(self.cfg.seed),
'--model', self.cfg.model_path,
'--served-model-name', self.cfg.served_model_name,
'--tensor-parallel-size', '1',
'--max-num-seqs', str(self.cfg.batch_size),
'--gpu-memory-utilization', str(self.cfg.gpu_memory_utilization),
'--host', '0.0.0.0',
'--port', str(self.port),
'--dtype', self.cfg.dtype,
'--kv-cache-dtype', self.cfg.kv_cache_dtype,
'--max-model-len', str(self.cfg.context_tokens),
'--stream-interval', str(self.cfg.stream_interval),
'--async-scheduling',
'--disable-log-stats',
'--enable-prefix-caching',
]
self.log_file = open('vllm_server.log', 'w')
return subprocess.Popen(
cmd,
stdout=self.log_file,
stderr=subprocess.STDOUT,
start_new_session=True
)
def _wait_for_server(self):
print('Waiting for vLLM server...')
start_time = time.time()
for _ in range(self.cfg.server_timeout):
return_code = self.server_process.poll()
if return_code is not None:
self.log_file.flush()
with open('vllm_server.log', 'r') as lf:
logs = lf.read()
raise RuntimeError(
f'Server died with code {return_code}. Full logs:\n{logs}\n'
)
try:
self.client.models.list()
elapsed = time.time() - start_time
print(f'Server is ready (took {elapsed:.2f} seconds).\n')
return
except Exception:
time.sleep(1)
raise RuntimeError('Server failed to start (timeout).\n')
def _initialize_kernels(self) -> None:
print(f'Initializing {self.cfg.workers} persistent Jupyter kernels...')
start_time = time.time()
self.sandbox_pool = queue.Queue()
def _create_sandbox():
return AIMO3Sandbox(timeout=self.cfg.jupyter_timeout)
with ThreadPoolExecutor(max_workers=self.cfg.workers) as executor:
futures = [
executor.submit(_create_sandbox)
for _ in range(self.cfg.workers)
]
for future in as_completed(futures):
self.sandbox_pool.put(future.result())
elapsed = time.time() - start_time
print(f'Kernels initialized in {elapsed:.2f} seconds.\n')
# ── Utilities ─────────────────────────────────────────────────────
def _scan_for_answer(self, text: str) -> int | None:
pattern = r'\\boxed\s*\{\s*([0-9,]+)\s*\}'
matches = re.findall(pattern, text)
if matches:
try:
value = int(matches[-1].replace(',', ''))
if 0 <= value <= 99999:
return value
except ValueError:
pass
pattern = r'final\s+answer\s+is\s*([0-9,]+)'
matches = re.findall(pattern, text, re.IGNORECASE)
if matches:
try:
value = int(matches[-1].replace(',', ''))
if 0 <= value <= 99999:
return value
except ValueError:
pass
return None
def _compute_mean_entropy(self, logprobs_buffer: list) -> float:
if not logprobs_buffer:
return float('inf')
total_entropy = 0.0
token_count = 0
for top_logprobs_dict in logprobs_buffer:
if not isinstance(top_logprobs_dict, dict) or not top_logprobs_dict:
continue
token_entropy = 0.0
for token_str, log_prob in top_logprobs_dict.items():
prob = math.exp(log_prob)
if prob > 0:
token_entropy -= prob * math.log2(prob)
total_entropy += token_entropy
token_count += 1
if token_count == 0:
return float('inf')
return total_entropy / token_count
def _compute_lead(self, valid_answers: list) -> tuple[int, int, int]:
"""
Returns (top_count, second_count, lead).
second_count = 0 when only one distinct answer exists.
lead is always an integer, never NaN or None.
"""
if not valid_answers:
return 0, 0, 0
counts = Counter(valid_answers).most_common()
top_count = counts[0][1]
second_count = counts[1][1] if len(counts) > 1 else 0
return top_count, second_count, top_count - second_count
def _has_converged(
self,
valid_answers: list,
is_first_batch: bool,
) -> bool:
"""
Single convergence check used in both:
- Early stop within a batch
- While loop exit after a batch completes
Three conditions must ALL be satisfied:
1. total_valid >= min_valid_answers (enough data to trust lead)
2. top_count >= early_stop (absolute vote strength)
3. lead >= min_lead_* (gap: stricter in batch 1)
min_valid_answers replaces both the old min_valid_for_lead AND
min_attempts_before_early_stop — they were the same concept.
"""
top_count, second_count, lead = self._compute_lead(valid_answers)
total_valid = len(valid_answers)
# Condition 1: enough valid answers to trust any signal
if total_valid < self.cfg.min_valid_answers:
return False
# Condition 2: top answer needs absolute vote strength
if top_count < self.cfg.early_stop:
return False
# Condition 3: lead threshold — stricter in first batch
min_lead = (
self.cfg.min_lead_first_batch
if is_first_batch
else self.cfg.min_lead_later_batches
)
if lead < min_lead:
return False
return True
def _pick_prompt(self, position: int) -> str:
"""
Returns the system prompt for a given position within a batch.
verify takes priority over alt.
Position 0,1,2 → standard (convergent, rigorous)
Position 3,4 → verify (brute-force enumeration required)
Position 5,6,7,8 → alt (different approach required)
"""
if position in self.cfg.verify_prompt_attempts:
return self.cfg.system_prompt_verify
elif position in self.cfg.alt_prompt_attempts:
return self.cfg.system_prompt_alt
else:
return self.cfg.system_prompt
# ── Single attempt ────────────────────────────────────────────────
def _process_attempt(
self,
problem: str,
system_prompt: str,
attempt_index: int,
stop_event: threading.Event,
deadline: float,
temperature_override: float | None = None,
) -> dict:
if stop_event.is_set() or time.time() > deadline:
return {
'Attempt': attempt_index,
'Prompt': 'skipped',
'Temperature': temperature_override,
'Answer': None,
'Entropy': float('inf'),
'Response Length': 0,
'Python Calls': 0,
'Python Errors': 0,
}
local_tool = None
sandbox = None
python_calls = 0
python_errors = 0
total_tokens = 0
final_answer = None
logprobs_buffer = []
attempt_seed = int(math.pow(self.cfg.seed + attempt_index, 2))
attempt_temperature = (
temperature_override
if temperature_override is not None
else self.cfg.attempt_temperatures[0]
)
# Identify prompt label for display
if system_prompt == self.cfg.system_prompt_verify:
prompt_label = 'verify'
elif system_prompt == self.cfg.system_prompt_alt:
prompt_label = 'alt'
else:
prompt_label = 'standard'
try:
sandbox = self.sandbox_pool.get(timeout=self.cfg.sandbox_timeout)
local_tool = AIMO3Tool(
local_jupyter_timeout=self.cfg.jupyter_timeout,
tool_prompt=self.cfg.tool_prompt,
sandbox=sandbox
)
encoding = self.encoding
messages = self.template.apply_chat_template(
system_prompt,
problem,
local_tool.tool_config
)
conversation = Conversation.from_messages(messages)
for _ in range(self.cfg.turns):
if stop_event.is_set() or time.time() > deadline:
break
prompt_ids = encoding.render_conversation_for_completion(
conversation, Role.ASSISTANT
)
max_tokens = self.cfg.context_tokens - len(prompt_ids)
if max_tokens < self.cfg.buffer_tokens:
break
stream = self.client.completions.create(
model=self.cfg.served_model_name,
temperature=attempt_temperature,
logprobs=self.cfg.top_logprobs,
max_tokens=max_tokens,
prompt=prompt_ids,
seed=attempt_seed,
stream=True,
extra_body={
'min_p': self.cfg.min_p,
'stop_token_ids': self.stop_token_ids,
'return_token_ids': True
}
)
try:
token_buffer = []
text_chunks = []
for chunk in stream:
if stop_event.is_set() or time.time() > deadline:
break
new_tokens = chunk.choices[0].token_ids
new_text = chunk.choices[0].text
if new_tokens:
token_buffer.extend(new_tokens)
total_tokens += len(new_tokens)
text_chunks.append(new_text)
chunk_logprobs = chunk.choices[0].logprobs
if chunk_logprobs is not None:
if chunk_logprobs.top_logprobs:
logprobs_buffer.extend(
chunk_logprobs.top_logprobs
)
if '}' in new_text:
search_text = ''.join(
text_chunks[-self.cfg.search_tokens:]
)
answer = self._scan_for_answer(search_text)
if answer is not None:
final_answer = answer
break
finally:
stream.close()
if final_answer is not None:
break
if not token_buffer:
break
new_messages = encoding.parse_messages_from_completion_tokens(
token_buffer, Role.ASSISTANT
)
conversation.messages.extend(new_messages)
last_message = new_messages[-1]
if last_message.channel == 'final':
final_answer = self._scan_for_answer(
last_message.content[0].text
)
break
if last_message.recipient == 'python':
python_calls += 1
tool_responses = local_tool.process_sync_plus(last_message)
response_text = tool_responses[0].content[0].text
if (
response_text.startswith('[ERROR]')
or 'Traceback' in response_text
or 'Error:' in response_text
):
python_errors += 1
conversation.messages.extend(tool_responses)
except Exception:
python_errors += 1
finally:
if sandbox is not None:
sandbox.reset()
self.sandbox_pool.put(sandbox)
return {
'Attempt': attempt_index,
'Prompt': prompt_label,
'Temperature': attempt_temperature,
'Answer': final_answer,
'Entropy': self._compute_mean_entropy(logprobs_buffer),
'Response Length': total_tokens,
'Python Calls': python_calls,
'Python Errors': python_errors,
}
# ── Run one parallel batch of 9 attempts ──────────────────────────
def _run_batch(
self,
user_input: str,
batch_number: int,
global_attempt_offset: int,
stop_event: threading.Event,
deadline: float,
detailed_results: list,
valid_answers: list,
is_first_batch: bool,
) -> None:
"""
Runs CFG.attempts attempts in parallel.
Appends to shared detailed_results and valid_answers in place.
Calls _has_converged after each result to decide early stop.
"""
print(
f'\n── Batch {batch_number} '
f'(attempts {global_attempt_offset}–'
f'{global_attempt_offset + self.cfg.attempts - 1}) ──'
)
executor = ThreadPoolExecutor(max_workers=self.cfg.workers)
batch_futures = []
try:
for position in range(self.cfg.attempts):
attempt_index = global_attempt_offset + position
temp = self.cfg.attempt_temperatures[
position % len(self.cfg.attempt_temperatures)
]
prompt = self._pick_prompt(position)
future = executor.submit(
self._process_attempt,
user_input,
prompt,
attempt_index,
stop_event,
deadline,
temp,
)
batch_futures.append(future)
for future in as_completed(batch_futures):
if stop_event.is_set():
break
try:
result = future.result()
detailed_results.append(result)
if result['Answer'] is not None:
valid_answers.append(result['Answer'])
if self._has_converged(valid_answers, is_first_batch):
top_count, second_count, lead = self._compute_lead(valid_answers)
print(
f' Early stop — '
f'total_valid={len(valid_answers)}, '
f'top={top_count}, second={second_count}, lead={lead}'
)
stop_event.set()
break
except Exception as exc:
print(f' Attempt failed: {exc}')
finally:
for f in batch_futures:
f.cancel()
executor.shutdown(wait=True, cancel_futures=True)
# ── Answer selection ──────────────────────────────────────────────
def _select_answer(self, detailed_results: list) -> int:
answer_weights = defaultdict(float)
answer_votes = defaultdict(int)
for result in detailed_results:
answer = result['Answer']
entropy = result['Entropy']
if answer is not None:
weight = 1.0 / max(entropy, 1e-9)
answer_weights[answer] += weight
answer_votes[answer] += 1
scored_answers = sorted(
[
{'answer': ans, 'votes': answer_votes[ans], 'score': score}
for ans, score in answer_weights.items()
],
key=lambda x: (x['votes'], x['score']),
reverse=True
)
vote_df = pd.DataFrame(
[
(s['answer'], s['votes'], round(s['score'], 3))
for s in scored_answers
],
columns=['Answer', 'Votes', 'Score']
)
display(vote_df)
if not scored_answers:
print('\nFinal Answer: 0\n')
return 0
final_answer = scored_answers[0]['answer']
print(f'\nFinal Answer: {final_answer}\n')
return final_answer
# ── Display combined results table ────────────────────────────────
def _display_results(self, detailed_results: list) -> None:
if not detailed_results:
return
results_df = pd.DataFrame(detailed_results)
results_df['Entropy'] = results_df['Entropy'].round(3)
results_df['Answer'] = results_df['Answer'].astype('Int64')
results_df['Temperature'] = results_df['Temperature'].astype(float).round(1)
cols = [
'Attempt', 'Prompt', 'Temperature', 'Answer', 'Entropy',
'Response Length', 'Python Calls', 'Python Errors'
]
cols = [c for c in cols if c in results_df.columns]
display(results_df[cols])
# ── Main solve loop ───────────────────────────────────────────────
def solve_problem(self, problem: str) -> int:
print(f'\nProblem: {problem}\n')
user_input = f'{problem} {self.cfg.preference_prompt}'
# ── Base budget (same formula as original) ────────────────────
elapsed_global = time.time() - self.notebook_start_time
time_left = self.cfg.notebook_limit - elapsed_global
problems_left_others = max(0, self.problems_remaining - 1)
reserved_time = problems_left_others * self.cfg.base_problem_timeout
base_budget = time_left - reserved_time
base_budget = min(base_budget, self.cfg.high_problem_timeout)
base_budget = max(base_budget, self.cfg.base_problem_timeout)
# ── Extended budget: 4x base, capped at 4x high_problem_timeout
extended_budget = min(
base_budget * self.cfg.budget_multiplier,
self.cfg.extended_timeout_cap
)
deadline = time.time() + extended_budget
print(
f'Base budget: {base_budget:.0f}s | '
f'Extended budget: {extended_budget:.0f}s | '
f'Deadline: {deadline:.2f}\n'
)
# ── Accumulate across all batches ─────────────────────────────
detailed_results = []
valid_answers = []
global_attempt_offset = 0
batch_number = 1
while True:
time_remaining = deadline - time.time()
if time_remaining < 30:
print(
f'Only {time_remaining:.0f}s left — '
f'not enough time for batch {batch_number}, stopping.'
)
break
is_first_batch = (batch_number == 1)
stop_event = threading.Event()
self._run_batch(
user_input=user_input,
batch_number=batch_number,
global_attempt_offset=global_attempt_offset,
stop_event=stop_event,
deadline=deadline,
detailed_results=detailed_results,
valid_answers=valid_answers,
is_first_batch=is_first_batch,
)
global_attempt_offset += self.cfg.attempts
batch_number += 1
top_count, second_count, lead = self._compute_lead(valid_answers)
print(
f'\nAfter batch {batch_number - 1}: '
f'distribution={Counter(valid_answers).most_common(5)} | '
f'top={top_count} | second={second_count} | lead={lead} | '
f'total_valid={len(valid_answers)}'
)
# While loop exit uses the later-batch threshold
# (is_first_batch=False) since we check after the batch completes
if self._has_converged(valid_answers, is_first_batch=False):
print(f'Converged after batch {batch_number - 1}. Stopping.')
break
time_remaining = deadline - time.time()
print(
f'Not yet converged — '
f'running batch {batch_number} '
f'({time_remaining:.0f}s remaining)'
)
# ── Decrement once, after all batches ─────────────────────────
self.problems_remaining = max(0, self.problems_remaining - 1)
# ── Display combined table and select answer ──────────────────
total_batches = batch_number - 1
print(f'\n{"─" * 60}')
print(
f'Total: {len(detailed_results)} attempts '
f'across {total_batches} batch(es)'
)
print(f'{"─" * 60}')
self._display_results(detailed_results)
if not valid_answers:
print('\nResult: 0\n')
return 0
return self._select_answer(detailed_results)
def __del__(self):
if hasattr(self, 'server_process'):
self.server_process.terminate()
self.server_process.wait()
if hasattr(self, 'log_file'):
self.log_file.close()
if hasattr(self, 'sandbox_pool'):
while not self.sandbox_pool.empty():
try:
sb = self.sandbox_pool.get_nowait()
sb.close()
except Exception:
pass
In [10]:
solver = AIMO3Solver(CFG)
Loading model weights from /kaggle/input/gpt-oss-120b/transformers/default/1 into OS Page Cache... Processed 26 files (65.28 GB) in 72.94 seconds. Waiting for vLLM server... Server is ready (took 126.08 seconds). Initializing 16 persistent Jupyter kernels... Kernels initialized in 2.89 seconds.
In [11]:
def predict(id_: pl.DataFrame, question: pl.DataFrame, answer: Optional[pl.DataFrame] = None) -> pl.DataFrame:
global correct_count, total_count, predictions
id_value = id_.item(0)
question_text = question.item(0)
gc.disable()
print("------")
print(f"ID: {id_value}")
print(f"Question: {question_text[:200]}...")
final_answer = solver.solve_problem(question_text)
predictions[id_value] = final_answer
gc.enable()
gc.collect()
total_count += 1
if id_value in ground_truth:
gt = ground_truth[id_value]
is_correct = (final_answer == gt)
if is_correct:
correct_count += 1
status = "✅" if is_correct else "❌"
print(f"Answer: {final_answer} | Ground Truth: {gt} | {status}")
print(f"📊 Running Accuracy: {correct_count}/{total_count} ({100*correct_count/total_count:.1f}%)")
else:
print(f"Answer: {final_answer}")
print("------\n")
return pl.DataFrame({'id': id_value, 'answer': final_answer})
# Load reference data and keep ground truth for accuracy calculation
df = pd.read_csv(
"/kaggle/input/ai-mathematical-olympiad-progress-prize-3/reference.csv"
)
# Store ground truth answers for accuracy calculation (only in local mode)
ground_truth = dict(zip(df["id"], df["answer"])) if "answer" in df.columns else {}
# Drop id and answer columns
df = df.drop(columns=["answer"], errors="ignore")
# Create input file without answers
df.to_csv("ref.csv", index=False)
# Track predictions for accuracy calculation
predictions = {}
correct_count = 0
total_count = 0
# df = df[df['id'].isin(['dd7f5e', '86e8e5'])]
# df = df[df['id'].isin(['86e8e5'])]
# df = pd.concat([df]*10, ignore_index=True)
df.to_csv("ref.csv", index=False)
In [12]:
print(df)
id problem
0 0e644e Let $ABC$ be an acute-angled triangle with int...
1 26de63 Define a function $f \colon \mathbb{Z}_{\geq 1...
2 424e18 A tournament is held with $2^{20}$ runners eac...
3 42d360 On a blackboard, Ken starts off by writing a p...
4 641659 Let $ABC$ be a triangle with $AB \neq AC$, cir...
5 86e8e5 Let $n \geq 6$ be a positive integer. We call ...
6 92ba6a Alice and Bob are each holding some integer nu...
7 9c1c5f Let $f \colon \mathbb{Z}_{\geq 1} \to \mathbb{...
8 a295e9 A $500 \times 500$ square is divided into $k$ ...
9 dd7f5e Let $\mathcal{F}$ be the set of functions $\al...
In [13]:
inference_server = kaggle_evaluation.aimo_3_inference_server.AIMO3InferenceServer(predict)
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
inference_server.serve()
else:
inference_server.run_local_gateway(
# ('/kaggle/input/ai-mathematical-olympiad-progress-prize-3/test.csv',)
('/kaggle/working/ref.csv',)
)
------
ID: 9c1c5f
Question: Let $f \colon \mathbb{Z}_{\geq 1} \to \mathbb{Z}_{\geq 1}$ be a function such that for all positive integers $m$ and $n$,
\begin{equation*}
f(m) + f(n) = f(m + n + mn).
\end{equation*}
Across all...
Problem: Let $f \colon \mathbb{Z}_{\geq 1} \to \mathbb{Z}_{\geq 1}$ be a function such that for all positive integers $m$ and $n$,
\begin{equation*}
f(m) + f(n) = f(m + n + mn).
\end{equation*}
Across all functions $f$ such that $f(n) \leq 1000$ for all $n \leq 1000$, how many different values can $f(2024)$ take?
Base budget: 900s | Extended budget: 2700s | Deadline: 1774625340.03
── Batch 1 (attempts 0–8) ──
Early stop — total_valid=6, top=6, second=0, lead=6
After batch 1: distribution=[(580, 6)] | top=6 | second=0 | lead=6 | total_valid=6
Converged after batch 1. Stopping.
────────────────────────────────────────────────────────────
Total: 6 attempts across 1 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3 | verify | 0.2 | 580 | 0.594 | 7046 | 5 | 1 |
| 1 | 5 | verify | 0.6 | 580 | 0.587 | 7056 | 6 | 2 |
| 2 | 4 | verify | 0.6 | 580 | 0.738 | 8201 | 11 | 1 |
| 3 | 2 | standard | 0.2 | 580 | 0.652 | 9057 | 7 | 1 |
| 4 | 8 | alt | 0.6 | 580 | 0.743 | 10728 | 12 | 0 |
| 5 | 6 | alt | 0.6 | 580 | 0.739 | 11047 | 11 | 1 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 580 | 6 | 8.972 |
Final Answer: 580
Answer: 580 | Ground Truth: 580 | ✅
📊 Running Accuracy: 1/1 (100.0%)
------
------
ID: 86e8e5
Question: Let $n \geq 6$ be a positive integer. We call a positive integer $n$-Norwegian if it has three distinct positive divisors whose sum is equal to $n$. Let $f(n)$ denote the smallest $n$-Norwegian positi...
Problem: Let $n \geq 6$ be a positive integer. We call a positive integer $n$-Norwegian if it has three distinct positive divisors whose sum is equal to $n$. Let $f(n)$ denote the smallest $n$-Norwegian positive integer. Let $M=3^{2025!}$ and for a non-negative integer $c$ define
\begin{equation*}
g(c)=\frac{1}{2025!}\left\lfloor \frac{2025! f(M+c)}{M}\right\rfloor.
\end{equation*}
We can write
\begin{equation*}
g(0)+g(4M)+g(1848374)+g(10162574)+g(265710644)+g(44636594)=\frac{p}{q}
\end{equation*}
where $p$ and $q$ are coprime positive integers. What is the remainder when $p+q$ is divided by $99991$?
Base budget: 900s | Extended budget: 2700s | Deadline: 1774625488.49
── Batch 1 (attempts 0–8) ──
After batch 1: distribution=[(51379, 1), (23, 1), (40614, 1), (85452, 1), (87208, 1)] | top=1 | second=1 | lead=0 | total_valid=7
Not yet converged — running batch 2 (2081s remaining)
── Batch 2 (attempts 9–17) ──
After batch 2: distribution=[(54680, 2), (96985, 2), (8687, 2), (51379, 1), (23, 1)] | top=2 | second=2 | lead=0 | total_valid=15
Not yet converged — running batch 3 (1499s remaining)
── Batch 3 (attempts 18–26) ──
After batch 3: distribution=[(23, 2), (54680, 2), (96985, 2), (8687, 2), (6825, 2)] | top=2 | second=2 | lead=0 | total_valid=21
Not yet converged — running batch 4 (737s remaining)
── Batch 4 (attempts 27–35) ──
After batch 4: distribution=[(8687, 4), (98449, 3), (23, 2), (54680, 2), (96985, 2)] | top=4 | second=3 | lead=1 | total_valid=27
Not yet converged — running batch 5 (-0s remaining)
Only -0s left — not enough time for batch 5, stopping.
────────────────────────────────────────────────────────────
Total: 36 attempts across 4 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | standard | 0.2 | 51379 | 0.620 | 19516 | 19 | 3 |
| 1 | 8 | alt | 0.6 | 23 | 0.686 | 30998 | 21 | 1 |
| 2 | 6 | alt | 0.6 | 40614 | 0.625 | 28399 | 37 | 8 |
| 3 | 3 | verify | 0.2 | 85452 | 0.570 | 28575 | 59 | 7 |
| 4 | 7 | alt | 0.6 | <NA> | 0.648 | 32779 | 35 | 3 |
| 5 | 1 | standard | 0.2 | 87208 | 0.636 | 34867 | 32 | 1 |
| 6 | 5 | verify | 0.6 | <NA> | 0.643 | 48952 | 63 | 4 |
| 7 | 2 | standard | 0.2 | 54680 | 0.537 | 45404 | 86 | 7 |
| 8 | 4 | verify | 0.6 | 1035 | 0.615 | 54424 | 47 | 2 |
| 9 | 16 | alt | 0.6 | 96985 | 0.660 | 23482 | 16 | 0 |
| 10 | 15 | alt | 0.6 | 13657 | 0.570 | 26359 | 39 | 4 |
| 11 | 9 | standard | 0.2 | 77463 | 0.636 | 33965 | 48 | 2 |
| 12 | 11 | standard | 0.2 | 8687 | 0.612 | 33023 | 34 | 5 |
| 13 | 13 | verify | 0.6 | 8687 | 0.631 | 34817 | 51 | 2 |
| 14 | 14 | verify | 0.6 | <NA> | 0.641 | 39611 | 30 | 3 |
| 15 | 12 | verify | 0.2 | 54680 | 0.589 | 37003 | 39 | 7 |
| 16 | 10 | standard | 0.2 | 96985 | 0.587 | 41373 | 56 | 4 |
| 17 | 17 | alt | 0.6 | 1233 | 0.582 | 45585 | 48 | 4 |
| 18 | 23 | verify | 0.6 | 6825 | 0.647 | 28463 | 37 | 2 |
| 19 | 22 | verify | 0.6 | 23 | 0.696 | 32131 | 16 | 1 |
| 20 | 26 | alt | 0.6 | 6825 | 0.587 | 33489 | 45 | 1 |
| 21 | 25 | alt | 0.6 | 92200 | 0.602 | 46552 | 62 | 5 |
| 22 | 20 | standard | 0.2 | 40958 | 0.617 | 39302 | 77 | 25 |
| 23 | 21 | verify | 0.2 | 64836 | 0.578 | 47921 | 88 | 8 |
| 24 | 19 | standard | 0.2 | <NA> | 0.560 | 48494 | 53 | 3 |
| 25 | 24 | alt | 0.6 | <NA> | 0.638 | 50776 | 115 | 2 |
| 26 | 18 | standard | 0.2 | <NA> | 0.565 | 59170 | 47 | 11 |
| 27 | 33 | alt | 0.6 | 98449 | 0.654 | 40707 | 42 | 4 |
| 28 | 27 | standard | 0.2 | 98449 | 0.605 | 46205 | 52 | 3 |
| 29 | 32 | verify | 0.6 | 98449 | 0.592 | 42522 | 68 | 15 |
| 30 | 34 | alt | 0.6 | 8687 | 0.625 | 46821 | 80 | 6 |
| 31 | 29 | standard | 0.2 | 8687 | 0.583 | 46392 | 61 | 7 |
| 32 | 28 | standard | 0.2 | 82321 | 0.562 | 48037 | 72 | 7 |
| 33 | 31 | verify | 0.6 | <NA> | 0.630 | 52173 | 38 | 3 |
| 34 | 35 | alt | 0.6 | <NA> | 0.599 | 52635 | 46 | 4 |
| 35 | 30 | verify | 0.2 | <NA> | 0.602 | 47787 | 78 | 4 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 8687 | 4 | 6.534 |
| 1 | 98449 | 3 | 4.872 |
| 2 | 54680 | 2 | 3.559 |
| 3 | 6825 | 2 | 3.250 |
| 4 | 96985 | 2 | 3.219 |
| 5 | 23 | 2 | 2.895 |
| 6 | 82321 | 1 | 1.780 |
| 7 | 85452 | 1 | 1.756 |
| 8 | 13657 | 1 | 1.754 |
| 9 | 64836 | 1 | 1.730 |
| 10 | 1233 | 1 | 1.718 |
| 11 | 92200 | 1 | 1.661 |
| 12 | 1035 | 1 | 1.626 |
| 13 | 40958 | 1 | 1.622 |
| 14 | 51379 | 1 | 1.614 |
| 15 | 40614 | 1 | 1.599 |
| 16 | 77463 | 1 | 1.573 |
| 17 | 87208 | 1 | 1.572 |
Final Answer: 8687
Answer: 8687 | Ground Truth: 8687 | ✅
📊 Running Accuracy: 2/2 (100.0%)
------
------
ID: 42d360
Question: On a blackboard, Ken starts off by writing a positive integer $n$ and then applies the following move until he first reaches $1$. Given that the number on the board is $m$, he chooses a base $b$, wher...
Problem: On a blackboard, Ken starts off by writing a positive integer $n$ and then applies the following move until he first reaches $1$. Given that the number on the board is $m$, he chooses a base $b$, where $2 \leq b \leq m$, and considers the unique base-$b$ representation of $m$,
\begin{equation*}
m = \sum_{k = 0}^\infty a_k \cdot b^k
\end{equation*}
where $a_k$ are non-negative integers and $0 \leq a_k < b$ for each $k$. Ken then erases $m$ on the blackboard and replaces it with $\sum\limits_{k = 0}^\infty a_k$.
Across all choices of $1 \leq n \leq 10^{10^5}$, the largest possible number of moves Ken could make is $M$. What is the remainder when $M$ is divided by $10^{5}$?
Base budget: 450s | Extended budget: 1349s | Deadline: 1774626838.48
── Batch 1 (attempts 0–8) ──
Early stop — total_valid=6, top=6, second=0, lead=6
After batch 1: distribution=[(32193, 6)] | top=6 | second=0 | lead=6 | total_valid=6
Converged after batch 1. Stopping.
────────────────────────────────────────────────────────────
Total: 6 attempts across 1 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 4 | verify | 0.6 | 32193 | 0.763 | 5379 | 4 | 0 |
| 1 | 3 | verify | 0.2 | 32193 | 0.616 | 6170 | 8 | 1 |
| 2 | 1 | standard | 0.2 | 32193 | 0.667 | 6225 | 5 | 0 |
| 3 | 6 | alt | 0.6 | 32193 | 0.661 | 7260 | 14 | 0 |
| 4 | 0 | standard | 0.2 | 32193 | 0.592 | 8507 | 9 | 0 |
| 5 | 8 | alt | 0.6 | 32193 | 0.665 | 9557 | 16 | 3 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 32193 | 6 | 9.14 |
Final Answer: 32193
Answer: 32193 | Ground Truth: 32193 | ✅
📊 Running Accuracy: 3/3 (100.0%)
------
------
ID: 26de63
Question: Define a function $f \colon \mathbb{Z}_{\geq 1} \to \mathbb{Z}_{\geq 1}$ by
\begin{equation*}
f(n) = \sum_{i = 1}^n \sum_{j = 1}^n j^{1024} \left\lfloor\frac1j + \frac{n-i}{n}\right\rfloor.
\end{e...
Problem: Define a function $f \colon \mathbb{Z}_{\geq 1} \to \mathbb{Z}_{\geq 1}$ by
\begin{equation*}
f(n) = \sum_{i = 1}^n \sum_{j = 1}^n j^{1024} \left\lfloor\frac1j + \frac{n-i}{n}\right\rfloor.
\end{equation*}
Let $M=2 \cdot 3 \cdot 5 \cdot 7 \cdot 11 \cdot 13$ and let $N = f{\left(M^{15}\right)} - f{\left(M^{15}-1\right)}$. Let $k$ be the largest non-negative integer such that $2^k$ divides $N$. What is the remainder when $2^k$ is divided by $5^7$?
Base budget: 639s | Extended budget: 1916s | Deadline: 1774627516.52
── Batch 1 (attempts 0–8) ──
Early stop — total_valid=6, top=6, second=0, lead=6
After batch 1: distribution=[(32951, 6)] | top=6 | second=0 | lead=6 | total_valid=6
Converged after batch 1. Stopping.
────────────────────────────────────────────────────────────
Total: 6 attempts across 1 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3 | verify | 0.2 | 32951 | 0.511 | 4741 | 2 | 1 |
| 1 | 0 | standard | 0.2 | 32951 | 0.552 | 5145 | 5 | 0 |
| 2 | 1 | standard | 0.2 | 32951 | 0.548 | 5695 | 7 | 2 |
| 3 | 2 | standard | 0.2 | 32951 | 0.481 | 7076 | 4 | 1 |
| 4 | 5 | verify | 0.6 | 32951 | 0.523 | 8004 | 6 | 0 |
| 5 | 6 | alt | 0.6 | 32951 | 0.519 | 7822 | 14 | 2 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 32951 | 6 | 11.512 |
Final Answer: 32951
Answer: 32951 | Ground Truth: 32951 | ✅
📊 Running Accuracy: 4/4 (100.0%)
------
------
ID: dd7f5e
Question: Let $\mathcal{F}$ be the set of functions $\alpha \colon \mathbb{Z}\to \mathbb{Z}$ for which there are only finitely many $n \in \mathbb{Z}$ such that $\alpha(n) \neq 0$.
For two functions $\alpha$ ...
Problem: Let $\mathcal{F}$ be the set of functions $\alpha \colon \mathbb{Z}\to \mathbb{Z}$ for which there are only finitely many $n \in \mathbb{Z}$ such that $\alpha(n) \neq 0$.
For two functions $\alpha$ and $\beta$ in $\mathcal{F}$, define their product $\alpha\star\beta$ to be $\sum\limits_{n\in\mathbb{Z}} \alpha(n)\cdot \beta(n)$. Also, for $n\in\mathbb{Z}$, define a shift operator $S_n \colon \mathcal{F}\to \mathcal{F}$ by $S_n(\alpha)(t)=\alpha(t+n)$ for all $t \in \mathbb{Z}$.
A function $\alpha \in \mathcal{F}$ is called \emph{shifty} if
\begin{itemize}
\item $\alpha(m)=0$ for all integers $m<0$ and $m>8$ and
\item There exists $\beta \in \mathcal{F}$ and integers $k \neq l$ such that for all $n \in \mathbb{Z}$
\begin{equation*}
S_n(\alpha)\star\beta =
\begin{cases}
1 & n \in \{k,l\} \\
0 & n \not \in \{k,l\}
\end{cases}
\; .
\end{equation*}
\end{itemize}
How many shifty functions are there in $\mathcal{F}$?
Base budget: 841s | Extended budget: 2522s | Deadline: 1774628220.51
── Batch 1 (attempts 0–8) ──
Early stop — total_valid=8, top=5, second=2, lead=3
After batch 1: distribution=[(160, 5), (266, 2), (44, 1)] | top=5 | second=2 | lead=3 | total_valid=8
Converged after batch 1. Stopping.
────────────────────────────────────────────────────────────
Total: 8 attempts across 1 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | standard | 0.2 | 160 | 0.738 | 13524 | 15 | 1 |
| 1 | 7 | alt | 0.6 | 160 | 0.769 | 15564 | 8 | 1 |
| 2 | 5 | verify | 0.6 | 44 | 0.684 | 18182 | 28 | 2 |
| 3 | 8 | alt | 0.6 | 160 | 0.678 | 18946 | 16 | 1 |
| 4 | 1 | standard | 0.2 | 266 | 0.672 | 19320 | 22 | 5 |
| 5 | 3 | verify | 0.2 | 266 | 0.573 | 18345 | 18 | 7 |
| 6 | 6 | alt | 0.6 | 160 | 0.679 | 20395 | 30 | 6 |
| 7 | 0 | standard | 0.2 | 160 | 0.658 | 22441 | 33 | 1 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 160 | 5 | 7.122 |
| 1 | 266 | 2 | 3.231 |
| 2 | 44 | 1 | 1.461 |
Final Answer: 160 Answer: 160 | Ground Truth: 160 | ✅ 📊 Running Accuracy: 5/5 (100.0%) ------ ------ ID: 92ba6a Question: Alice and Bob are each holding some integer number of sweets. Alice says to Bob: ``If we each added the number of sweets we're holding to our (positive integer) age, my answer would be double yours. I... Problem: Alice and Bob are each holding some integer number of sweets. Alice says to Bob: ``If we each added the number of sweets we're holding to our (positive integer) age, my answer would be double yours. If we took the product, then my answer would be four times yours.'' Bob replies: ``Why don't you give me five of your sweets because then both our sum and product would be equal.'' What is the product of Alice and Bob's ages? Base budget: 864s | Extended budget: 2591s | Deadline: 1774628566.92 ── Batch 1 (attempts 0–8) ── Early stop — total_valid=6, top=6, second=0, lead=6 After batch 1: distribution=[(50, 6)] | top=6 | second=0 | lead=6 | total_valid=6 Converged after batch 1. Stopping. ──────────────────────────────────────────────────────────── Total: 6 attempts across 1 batch(es) ────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 8 | alt | 0.6 | 50 | 0.495 | 1772 | 2 | 0 |
| 1 | 2 | standard | 0.2 | 50 | 0.554 | 1835 | 1 | 0 |
| 2 | 0 | standard | 0.2 | 50 | 0.444 | 2646 | 1 | 0 |
| 3 | 5 | verify | 0.6 | 50 | 0.406 | 2642 | 2 | 0 |
| 4 | 7 | alt | 0.6 | 50 | 0.655 | 2910 | 1 | 0 |
| 5 | 4 | verify | 0.6 | 50 | 0.521 | 3092 | 2 | 0 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 50 | 6 | 11.987 |
Final Answer: 50
Answer: 50 | Ground Truth: 50 | ✅
📊 Running Accuracy: 6/6 (100.0%)
------
------
ID: 424e18
Question: A tournament is held with $2^{20}$ runners each of which has a different running speed. In each race, two runners compete against each other with the faster runner always winning the race. The competi...
Problem: A tournament is held with $2^{20}$ runners each of which has a different running speed. In each race, two runners compete against each other with the faster runner always winning the race. The competition consists of $20$ rounds with each runner starting with a score of $0$. In each round, the runners are paired in such a way that in each pair, both runners have the same score at the beginning of the round. The winner of each race in the $i^{\text{th}}$ round receives $2^{20-i}$ points and the loser gets no points.
At the end of the tournament, we rank the competitors according to their scores. Let $N$ denote the number of possible orderings of the competitors at the end of the tournament. Let $k$ be the largest positive integer such that $10^k$ divides $N$. What is the remainder when $k$ is divided by $10^{5}$?
Base budget: 900s | Extended budget: 2700s | Deadline: 1774628714.22
── Batch 1 (attempts 0–8) ──
After batch 1: distribution=[(21818, 5), (62140, 4)] | top=5 | second=4 | lead=1 | total_valid=9
Not yet converged — running batch 2 (2415s remaining)
── Batch 2 (attempts 9–17) ──
Early stop — total_valid=10, top=6, second=4, lead=2
After batch 2: distribution=[(21818, 6), (62140, 4)] | top=6 | second=4 | lead=2 | total_valid=10
Converged after batch 2. Stopping.
────────────────────────────────────────────────────────────
Total: 10 attempts across 2 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 5 | verify | 0.6 | 62140 | 0.842 | 7942 | 3 | 0 |
| 1 | 8 | alt | 0.6 | 62140 | 0.901 | 8546 | 3 | 0 |
| 2 | 4 | verify | 0.6 | 21818 | 0.768 | 10854 | 7 | 0 |
| 3 | 1 | standard | 0.2 | 21818 | 0.830 | 12850 | 5 | 0 |
| 4 | 0 | standard | 0.2 | 21818 | 0.700 | 13887 | 18 | 1 |
| 5 | 2 | standard | 0.2 | 62140 | 0.843 | 16277 | 4 | 1 |
| 6 | 7 | alt | 0.6 | 21818 | 0.786 | 16722 | 11 | 0 |
| 7 | 3 | verify | 0.2 | 62140 | 0.732 | 23759 | 25 | 0 |
| 8 | 6 | alt | 0.6 | 21818 | 0.686 | 30257 | 27 | 2 |
| 9 | 11 | standard | 0.2 | 21818 | 0.716 | 11066 | 9 | 1 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 21818 | 6 | 8.063 |
| 1 | 62140 | 4 | 4.849 |
Final Answer: 21818
Answer: 21818 | Ground Truth: 21818 | ✅
📊 Running Accuracy: 7/7 (100.0%)
------
------
ID: 0e644e
Question: Let $ABC$ be an acute-angled triangle with integer side lengths and $AB<AC$. Points $D$ and $E$ lie on segments $BC$ and $AC$, respectively, such that $AD=AE=AB$. Line $DE$ intersects $AB$ at $X$. Cir...
Problem: Let $ABC$ be an acute-angled triangle with integer side lengths and $AB<AC$. Points $D$ and $E$ lie on segments $BC$ and $AC$, respectively, such that $AD=AE=AB$. Line $DE$ intersects $AB$ at $X$. Circles $BXD$ and $CED$ intersect for the second time at $Y \neq D$. Suppose that $Y$ lies on line $AD$. There is a unique such triangle with minimal perimeter. This triangle has side lengths $a=BC$, $b=CA$, and $c=AB$. Find the remainder when $abc$ is divided by $10^{5}$.
Base budget: 900s | Extended budget: 2700s | Deadline: 1774629153.98
── Batch 1 (attempts 0–8) ──
Early stop — total_valid=6, top=6, second=0, lead=6
After batch 1: distribution=[(336, 6)] | top=6 | second=0 | lead=6 | total_valid=6
Converged after batch 1. Stopping.
────────────────────────────────────────────────────────────
Total: 6 attempts across 1 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 6 | alt | 0.6 | 336 | 0.569 | 11819 | 18 | 2 |
| 1 | 2 | standard | 0.2 | 336 | 0.606 | 13297 | 12 | 1 |
| 2 | 7 | alt | 0.6 | 336 | 0.513 | 12986 | 23 | 2 |
| 3 | 8 | alt | 0.6 | 336 | 0.516 | 14895 | 17 | 1 |
| 4 | 4 | verify | 0.6 | 336 | 0.551 | 15454 | 15 | 3 |
| 5 | 5 | verify | 0.6 | 336 | 0.503 | 21184 | 25 | 6 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 336 | 6 | 11.098 |
Final Answer: 336
Answer: 336 | Ground Truth: 336 | ✅
📊 Running Accuracy: 8/8 (100.0%)
------
------
ID: a295e9
Question: A $500 \times 500$ square is divided into $k$ rectangles, each having integer side lengths. Given that no two of these rectangles have the same perimeter, the largest possible value of $k$ is $\mathca...
Problem: A $500 \times 500$ square is divided into $k$ rectangles, each having integer side lengths. Given that no two of these rectangles have the same perimeter, the largest possible value of $k$ is $\mathcal{K}$. What is the remainder when $k$ is divided by $10^{5}$?
Base budget: 900s | Extended budget: 2700s | Deadline: 1774629407.88
── Batch 1 (attempts 0–8) ──
Early stop — total_valid=6, top=5, second=1, lead=4
After batch 1: distribution=[(520, 5), (186, 1)] | top=5 | second=1 | lead=4 | total_valid=6
Converged after batch 1. Stopping.
────────────────────────────────────────────────────────────
Total: 6 attempts across 1 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 8 | alt | 0.6 | 520 | 0.912 | 10020 | 6 | 0 |
| 1 | 2 | standard | 0.2 | 520 | 0.901 | 12290 | 5 | 0 |
| 2 | 3 | verify | 0.2 | 520 | 0.820 | 15301 | 8 | 0 |
| 3 | 0 | standard | 0.2 | 186 | 0.779 | 17588 | 3 | 0 |
| 4 | 4 | verify | 0.6 | 520 | 0.882 | 20225 | 15 | 0 |
| 5 | 1 | standard | 0.2 | 520 | 0.859 | 23439 | 9 | 0 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 520 | 5 | 5.724 |
| 1 | 186 | 1 | 1.283 |
Final Answer: 520
Answer: 520 | Ground Truth: 520 | ✅
📊 Running Accuracy: 9/9 (100.0%)
------
------
ID: 641659
Question: Let $ABC$ be a triangle with $AB \neq AC$, circumcircle $\Omega$, and incircle $\omega$. Let the contact points of $\omega$ with $BC$, $CA$, and $AB$ be $D$, $E$, and $F$, respectively. Let the circum...
Problem: Let $ABC$ be a triangle with $AB \neq AC$, circumcircle $\Omega$, and incircle $\omega$. Let the contact points of $\omega$ with $BC$, $CA$, and $AB$ be $D$, $E$, and $F$, respectively. Let the circumcircle of $AFE$ meet $\Omega$ at $K$ and let the reflection of $K$ in $EF$ be $K'$. Let $N$ denote the foot of the perpendicular from $D$ to $EF$. The circle tangent to line $BN$ and passing through $B$ and $K$ intersects $BC$ again at $T \neq B$.
Let sequence $(F_n)_{n \geq 0}$ be defined by $F_0 = 0$, $F_1 = 1$ and for $n \geq 2$, $F_n = F_{n-1} + F_{n-2}$. Call $ABC$ $n$\emph{-tastic} if $BD = F_n$, $CD = F_{n+1}$, and $KNK'B$ is cyclic. Across all $n$-tastic triangles, let $a_n$ denote the maximum possible value of $\frac{CT \cdot NB}{BT \cdot NE}$. Let $\alpha$ denote the smallest real number such that for all sufficiently large $n$, $a_{2n} < \alpha$. Given that $\alpha = p + \sqrt{q}$ for rationals $p$ and $q$, what is the remainder when $\left\lfloor p^{q^p} \right\rfloor$ is divided by $99991$?
Base budget: 900s | Extended budget: 2700s | Deadline: 1774629674.61
── Batch 1 (attempts 0–8) ──
Early stop — total_valid=6, top=5, second=1, lead=4
After batch 1: distribution=[(57447, 5), (1, 1)] | top=5 | second=1 | lead=4 | total_valid=6
Converged after batch 1. Stopping.
────────────────────────────────────────────────────────────
Total: 6 attempts across 1 batch(es)
────────────────────────────────────────────────────────────
| Attempt | Prompt | Temperature | Answer | Entropy | Response Length | Python Calls | Python Errors | |
|---|---|---|---|---|---|---|---|---|
| 0 | 5 | verify | 0.6 | 57447 | 0.486 | 13115 | 6 | 1 |
| 1 | 3 | verify | 0.2 | 57447 | 0.325 | 17483 | 32 | 9 |
| 2 | 0 | standard | 0.2 | 57447 | 0.501 | 22950 | 45 | 5 |
| 3 | 4 | verify | 0.6 | 57447 | 0.412 | 24048 | 32 | 5 |
| 4 | 7 | alt | 0.6 | 57447 | 0.479 | 23111 | 17 | 6 |
| 5 | 2 | standard | 0.2 | 1 | 0.537 | 25198 | 54 | 6 |
| Answer | Votes | Score | |
|---|---|---|---|
| 0 | 57447 | 5 | 11.644 |
| 1 | 1 | 1 | 1.862 |
Final Answer: 57447 Answer: 57447 | Ground Truth: 57447 | ✅ 📊 Running Accuracy: 10/10 (100.0%) ------