Overview of AdaMEM. Instead of relying on a static, episode-level strategy, the agent adapts to the current decision step by querying a long-term trajectory memory of raw experiences (1–2) and synthesizing them into a dynamic short-term strategy memory tailored to the current state (3). Conditioned on this test-time strategy, the agent adapts its next action (4) without requiring parameter updates.
A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments.
Prior work such as Synapse and ReasoningBank restricts memory retrieval to episode initiation, forcing agents to follow a fixed plan throughout the task. AdaMEM overcomes this by maintaining a scalable long-term trajectory memory of successful past experiences and dynamically synthesizing a concise short-term strategy memory at each decision step. Two inference modes trade off cost and adaptability: AdaMEM-HIGH generates a fresh transient strategy whenever the agent requests memory, while AdaMEM-LOW persists a strategy and lets the agent decide when to refresh it.
Comparison of test-time agent memory mechanisms. ReAct operates without external memory. Synapse and ReasoningBank employ static initialization, retrieving a trajectory or strategy only at the episode start (S0). In contrast, AdaMEM enables test-time adaptation via dynamic memory retrieval and synthesis. AdaMEM-LOW balances efficiency by maintaining a persistent strategy and triggering a refresh only when necessary. AdaMEM-HIGH maximizes adaptability by regenerating a fresh strategy at every critical decision step.
Although AdaMEM works well zero-shot, simple prompting can yield over-general strategies. STEP-MFT trains the policy to generate high-utility strategies using a process-level signal: a strategy is kept for training only if it changed the agent's next action on a successful trajectory. This dual filter (outcome success + action change) efficiently assigns credit at the step level without requiring rollouts or a specialized critic.
STEP-MFT framework. We employ dual-filter rejection sampling to curate high-utility strategies for supervised fine-tuning. The process retains only successful trajectories where the strategy actually changes the proposed action (At ≠ A′t, green), while discarding redundant instances where the memory-free baseline yields the same action (At = A′t, blue).
Performance comparison under training-free setups (mean ± std over 3 runs). Bold = highest, underline = second highest.
| Memory Mechanism | Long-Term | Short-Term | ALFWorld | WebShop | |
|---|---|---|---|---|---|
| Seen | Unseen | ||||
| On-Policy Long-Term Memory (Qwen3-4B-Instruct / Qwen2.5-7B-Instruct-RL) | |||||
| No Memory | — | — | 45.2 ± 1.8 | 46.8 ± 2.5 | 71.4 ± 1.4 |
| ReasoningBank | Strat. | Strat. | 49.3 ± 0.7 | 51.2 ± 0.9 | 68.6 ± 2.0 |
| Synapse | Traj. | Traj. | 52.1 ± 1.9 | 52.2 ± 0.7 | 65.4 ± 0.9 |
| AdaMEM (ours) | Traj. | Strat. | 54.0 ± 2.9 | 58.2 ± 3.9 | 74.2 ± 0.3 |
| Off-Policy Long-Term Memory (Gemma-3-27b-it) | |||||
| No Memory | — | — | 36.7 ± 0.8 | 37.6 ± 1.9 | 18.2 ± 0.1 |
| ReasoningBank | Strat. | Strat. | 45.2 ± 2.2 | 44.8 ± 1.3 | 18.6 ± 0.5 |
| Synapse | Traj. | Traj. | 35.0 ± 2.1 | 30.6 ± 0.7 | 22.6 ± 0.5 |
| AdaMEM (ours) | Traj. | Strat. | 47.4 ± 2.2 | 49.5 ± 2.6 | 24.7 ± 1.2 |
Impact of Memory Fine-Tuning. STEP-MFT (step-level filter) consistently outperforms outcome-based MFT and training-free baselines on both ALFWorld and WebShop.
Effectiveness vs. efficiency trade-off. AdaMEM establishes a superior Pareto frontier between task performance and average inference tokens per step on both ALFWorld and WebShop.
Scalability with retrieval budget k. AdaMEM scales monotonically with more retrieved experiences, while Synapse degrades as injecting additional raw trajectories leads to context overflow.
AdaMEM recovers from stale strategies mid-episode. When an initial strategy directs the agent to the wrong location, AdaMEM detects the mismatch, explicitly reasons that the "strategy is outdated," and triggers a refresh — recovering where static agents loop or hallucinate.