Meta-Optimized Risk-Aware Portfolio Management: A Hybrid Deep Reinforcement Learning and LSTM-GRU Ensemble
M. Sorouri, D. NoorMohammadzadehMaleki, A. Salehi, A. Farhadi, A. Zamanifar
Markets don't punish you on average — they punish you on the worst days. A trading agent trained to chase returns will happily take on hidden tail risk, right up until a single bad week erases a year of gains. This paper is about building an agent that respects that asymmetry: one that forecasts, acts, and is explicitly afraid of the left tail.
Three problems at once
Portfolio management is a sequential decision problem under deep uncertainty, and a practical system has to solve three things together. It must forecast shifting market dynamics, it must act on those forecasts as allocation decisions, and it must stay robust — to volatile markets and to its own brittle hyper-parameters. Optimising any one of these in isolation is what makes most RL traders fragile.
Forecasting and control, fused by a meta-controller
The framework runs forecasting and control in parallel. An LSTM-GRU ensemble predicts market dynamics, while several deep RL agents — each a different view of the market — learn allocation policies at the same time. A meta-learning controller then coordinates them, adapting how their signals are blended to current conditions. Hover the architecture to see how the pieces connect:
Hover a block to see its role. Forecasting and control run in parallel, fused by a meta-controller.
Being afraid of the left tail
What makes the agents risk-aware is the reward. Instead of maximising raw return or merely penalising variance, the objective targets Conditional Value-at-Risk — the average loss in the worst fraction of outcomes:
Where Value-at-Risk only marks the threshold of a bad day, CVaR measures how bad the bad days actually are. Drag the cutoff below to feel the difference:
- Tail probability
- 23%
- VaR (cutoff)
- -5.3%
- CVaR (avg. tail)
- -7.9%
Drag the red cutoff. VaR is the threshold; CVaR is the average loss in the shaded worst-case tail. Optimising CVaR makes the agent care about how bad the bad days are — not just their frequency. Conceptual illustration of the objective, not measured returns.
By minimising CVaR, the agents are steered away from allocations that look attractive on average but hide catastrophic tails — exactly the failure mode that sinks return-greedy strategies.
Robust by construction
The last piece is meta-optimization: rather than hand-tuning the many knobs that govern training, the controller tunes the pipeline itself, and the whole system is validated with Bayesian backtesting so results aren't an artifact of one lucky configuration or split.
Why it matters
Bringing ensemble forecasting, multi-agent risk-sensitive RL, and meta-optimization under one roof points toward portfolio managers that are both adaptive and resilient — chasing growth while keeping an explicit grip on the downside.
The visuals on this page are conceptual: they illustrate the architecture and the CVaR objective. The published study reports the quantitative backtesting results.
Cite this work
bibtex@inproceedings{sorouri2025metaoptimized, title = {Meta-Optimized Risk-Aware Portfolio Management: A Hybrid Deep Reinforcement Learning and LSTM-GRU Ensemble}, author = {M. Sorouri and D. NoorMohammadzadehMaleki and A. Salehi and A. Farhadi and A. Zamanifar}, booktitle = {2025 10th South-East Europe Design Automation, Computer Engineering Conference (SEEDA-CECNSM)}, year = {2025}, doi = {10.1109/SEEDA-CECNSM68644.2025.11329752} }