← Back to Projects
Reinforcement Learning PyTorch OpenAI Gymnasium Actor-Critic

Teaching a Rocket
to Land

Reinforcement Learning — Prog 02
LunarLander-v3
PyTorch
1,000
Group 9
View on GitHub
187.27Final Avg Reward (Actor-Critic)
−219.30Random Policy Avg Reward
+406Points Better Than Random
1,000Training Episodes
4Discrete Actions

The environment

LunarLander-v3 is a physics simulation where a spacecraft starts mid-air above a lunar surface. Two yellow flags mark the landing pad. The goal is to land between them, upright, without crashing. The agent controls four thrusters: do nothing, fire left engine, fire main engine, fire right engine.

Every timestep, the environment returns 8 numbers describing the lander's current state — and a reward signal telling the agent how well it's doing. There are no instructions. Just experience.

LunarLander-v3 environment — lander approaching the landing pad
The LunarLander-v3 environment — the agent must land between the two yellow flags
environment setup
env = gym.make("LunarLander-v3", render_mode="rgb_array")
state, info = env.reset()

# State space: 8 continuous values
# Box([-2.5 -2.5 -10. -10. -6.28 -10. -0. -0.],
#     [ 2.5  2.5  10.  10.  6.28  10.  1.  1.], shape=(8,))

# Action space: 4 discrete actions
# Discrete(4) — do nothing / left engine / main engine / right engine

print("State space:", env.observation_space)   # Box (8,)
print("Action space:", env.action_space)        # Discrete(4)

The 8 state values are: x position, y position, x velocity, y velocity, angle, angular velocity, and two booleans for whether the left and right legs are touching the ground. That's everything the agent gets to work with.

Why Actor-Critic?

You could use a simpler algorithm — just tracking which actions led to good outcomes and repeating them. But estimates based on full episode rewards are noisy. A single run can go well or badly by chance, and you'd update your policy based on that noise.

The Critic solves this by estimating how good each state is in real time, not just at the end of an episode. It gives the Actor a much more stable, moment-by-moment signal to learn from. Think of it like having a coach watching in real time rather than only giving feedback at the final whistle.

The two neural networks

Both networks share the same 8-value input. They diverge completely in what they output.

actor network — decides what to do
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        self.fc1 = nn.Linear(state_dim, 128)  # 8 inputs → 128 neurons
        self.fc2 = nn.Linear(128, 64)          # 128 → 64
        self.fc3 = nn.Linear(64, action_dim)   # 64 → 4 action probabilities

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return torch.softmax(self.fc3(x), dim=-1)  # probabilities sum to 1
critic network — judges how good the situation is
class Critic(nn.Module):
    def __init__(self, state_dim):
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)   # single scalar: V(s)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)  # no activation — raw value, can be negative

What the networks output before any training

Before training starts, both networks are randomly initialised. Here's what the very first forward pass looks like — before the agent has learned anything:

forward pass — untrained networks
state        = torch.tensor(state, dtype=torch.float32)
action_probs = actor(state)
state_value  = critic(state)

# Output:
Action probabilities: tensor([0.2508, 0.2384, 0.2497, 0.2611])
State value:          tensor([-0.1508])

Notice how the action probabilities are almost perfectly equal — 25% each. That's a randomly initialised network with no preference at all. And the state value is −0.15, essentially saying "I have no idea how good this is." This is the starting point. Everything learned from here is through trial and error.

The full training loop

Every timestep follows the same five-step cycle. Here's the actual code:

training loop — full implementation
rewards_history = []

for episode in range(num_episodes):
    state, info = env.reset()
    state = torch.tensor(state, dtype=torch.float32)
    episode_rewards = 0

    for step in range(max_steps_per_episode):

        # Step 1: Actor picks an action probabilistically
        action_probs = actor(state)
        action_dist  = torch.distributions.Categorical(action_probs)
        action       = action_dist.sample()

        # Step 2: Execute in environment
        next_state, reward, done, truncated, info = env.step(action.item())
        next_state = torch.tensor(next_state, dtype=torch.float32)
        episode_rewards += reward

        # Step 3: Compute advantage — was this better or worse than expected?
        value      = critic(state)
        next_value = critic(next_state) if not done else torch.tensor(0.0)
        target     = reward + gamma * next_value.item()
        advantage  = target - value.item()  # TD error

        # Step 4: Update Actor — do more of what worked
        actor_loss = -action_dist.log_prob(action) * advantage
        actor_optimizer.zero_grad()
        actor_loss.backward()
        actor_optimizer.step()

        # Step 5: Update Critic — get better at predicting state values
        target_tensor = torch.tensor(target, dtype=torch.float32)
        critic_loss   = value_loss_fn(value, target_tensor)
        critic_optimizer.zero_grad()
        critic_loss.backward()
        critic_optimizer.step()

        state = next_state
        if done: break

    rewards_history.append(episode_rewards)

Hyperparameters

0.001Actor LR (Adam)
0.005Critic LR (Adam)
0.99Discount Factor γ
1,000Episodes
1,000Max Steps/Episode

The Critic's learning rate (0.005) is 5× higher than the Actor's (0.001). This is intentional — the Critic needs to develop a useful baseline before the Actor can meaningfully improve against it. If they learn at the same rate, neither gets a stable signal.

The real training log — every 10 episodes

This is the actual output from running the notebook. You can see the full story of the agent learning, failing, discovering something, losing it again, then eventually converging:

training output — selected episodes
# Early chaos — crashing on almost every attempt
Episode   0,  Reward: -232.47
Episode  10,  Reward: -1293.77   # worst episode
Episode  20,  Reward: -407.11
Episode  50,  Reward: -576.70
Episode  90,  Reward: -1021.64

# Still struggling but stabilising around -500
Episode 100,  Reward: -716.49
Episode 200,  Reward: -497.55

# First hints of learning — approaching zero
Episode 220,  Reward: -274.53
Episode 230,  Reward: -21.30    # nearly zero for the first time
Episode 330,  Reward: +8.58     # first positive episode!

# First successful landing
Episode 360,  Reward: +265.58   # breakthrough moment

# Convergence — mostly positive, occasional relapses
Episode 430,  Reward: +241.96
Episode 440,  Reward: +299.77
Episode 450,  Reward: +284.86
Episode 530,  Reward: +284.78
Episode 640,  Reward: +259.95
Episode 770,  Reward: +249.20
Episode 850,  Reward: +250.56
Episode 890,  Reward: +276.66
Episode 990,  Reward: +225.91

The learning curve

Episode 10 was the worst — reward of −1,293. Episode 330 was the first to go positive. Episode 360 hit +265, which was the first clear successful landing. After that, rewards clustered above zero with occasional relapses, converging to an average of +187.27.

Learning Curve — Cumulative Reward over 1000 Episodes
Cumulative reward per episode over 1,000 training runs — from −1,293 to consistent +200s

Evaluation — how good is it really?

During evaluation, the Actor doesn't sample probabilistically — it just picks the action with the highest probability (greedy). This removes exploration noise and shows the true learned policy.

evaluation — greedy policy
def evaluate_model(actor, num_episodes=10):
    for _ in range(num_episodes):
        # During eval: pick the BEST action, no sampling
        action = torch.argmax(action_probs).item()

evaluation_rewards = evaluate_model(actor)
print(f"Average Reward Over Evaluation: {np.mean(evaluation_rewards)}")

# Output:
Average Reward Over Evaluation: 187.27428490096685
random policy comparison
random_rewards = random_policy(env)
print(f"Average Reward for Random Policy: {np.mean(random_rewards)}")

# Output:
Average Reward for Random Policy: -219.30102964861507
The comparison that matters

Random policy: −219.30. Trained Actor-Critic: +187.27. That's a gap of over 406 points — the entire difference between crashing every single time and reliably sticking the landing. The algorithm genuinely learned to fly.

What was actually hard

What I'd try next

Team