Visual sensors cause inference quality loss in Unity vs Python

Visual observations cause significantly degraded performance when running trained ONNX models in Unity, despite working perfectly during Python inference. The agent exhibits noisy behavior and makes incorrect decisions that don't occur during training or Python inference. When the same data is marked as Vector observations instead of Visual, both training and Unity inference work correctly.
To Reproduce
Steps to reproduce the behavior:

Create a custom visual sensor (50x50 pixels, 3 channels) using ObservationSpec.Visual()
Train using mlagents-learn config.yaml --run-id=test_run
Test with Python inference: mlagents-learn config.yaml --run-id=test_run --resume --inference - works perfectly
Load the generated .onnx file in Unity using standard ML-Agents package - agent behavior becomes noisy and makes incorrect decisions
Change the same sensor to use ObservationSpec.Vector() instead and retrain - both Python inference and Unity work correctly

Console logs / stack traces
No error messages or stack traces appear in Unity console. The model loads successfully but produces degraded results.
Code snippets

Custom visual sensor implementation key parts:
```
public ObservationSpec GetObservationSpec() => ObservationSpec.Visual(channels, height, width, ObservationType.Default);

public int Write(ObservationWriter writer) {
    for (int y = 0; y < height; ++y) {
        for (int x = 0; x < width; ++x) {
            ref var pixel = ref data[y * width + x];
            writer[0, y, x] = pixel.tower;
            writer[1, y, x] = pixel.monster;
            writer[2, y, x] = pixel.coin;
        }
    }
    return width * height * channels;
}
```

Training configuration:
```
behaviors:
  UniversalBot:
    trainer_type: ppo
    hyperparameters:
      batch_size: 2048            
      buffer_size: 40960         
      learning_rate: 1.0e-3      
      learning_rate_schedule: linear
      beta: 0.01               
      epsilon: 0.2           
      lambd: 0.95               
      num_epoch: 2             
    
    network_settings:
      normalize: true           
      hidden_units: 128          
      num_layers: 1
      vis_encode_type: simple     
    
    reward_signals:
      extrinsic:
        strength: 1.0
        gamma: 0.97       
      rnd:
        strength: 0.005
        gamma: 0.99
        encoding_size: 32
        learning_rate: 1e-4
    
    max_steps: 4000000
    time_horizon: 128           
    summary_freq: 10000
    
    keep_checkpoints: 1000
    checkpoint_interval: 500000
    threaded: false
```
Environment (please complete the following information):

Unity Version: 6000.0.43f1
OS + version: Windows 11
ML-Agents version: Both develop branch and release 22 (no difference observed)
Torch version: 2.2.2+cu121
Environment: Custom environment with simple visual sensor (50x50x3)

Additional Information:

Tested with both vis_encode_type: simple and nature_cnn - no difference
Different resolutions tested - issue persists
Training duration doesn't affect the issue
The issue only occurs when observations are marked as Visual; Vector observations work correctly
Python inference with --inference flag works perfectly, suggesting the issue is specific to Unity's ONNX execution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Visual sensors cause inference quality loss in Unity vs Python #6237

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Visual sensors cause inference quality loss in Unity vs Python #6237

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions