HTSP

H-TSP is an end-to-end deep reinforcement learning approach that can handle TSP problems with up to 10,000 nodes. H-TSP uses a hierarchical reinforcement learning approach, which means it breaks down the big problem into smaller parts and solves them in two steps.

HTSPInitialization

Bases: Initialization

Methods:

  • In training mode, the HTSP policy is executed to collect experiences(explore_trajectory, update_buffer) and use PPO to update actor and crtic network.

  • In evaluation mode, the HTSP policy sample a point through upper level network, and use this point and its related points to construct a path through lower level solver, which is realized by HTSP_test_step function.

Model

The upper level network for HTSP, includes IMPALAEnocder, ActorPPO, CriticPPO.

Network type function
IMPALAEncoder CNN output a feature map
ActorPPO MLP ouput the coordinate of center point
CriticPPO MLP ouput the value of state
class IMPALAEncoder(nn.ModuleDict):
    def __init__(
        self,
        input_dim,
        embedding_dim,
        bev_range=None,
        bev_pixel_size=None,
        depths=None,
    **kwargs):
class ActorPPO(nn.Module):
    def __init__(
        self, 
        state_dim, 
        mid_dim, 
        action_dim, 
        init_a_std_log=-0.5, 
        **kwargs):
```python

```python
class CriticPPO(nn.Module):
    def __init__(
        self, 
        state_dim, 
        mid_dim, 
        _action_dim,
        **kwargs):

Policy

  • HTSP_policy

    • Backbone

      • encoder: the TSP graph is converted into a pseudo-image by discretizing the 2D space into a grid.The CNN encoder’s function is to process a “pseudo-image” that represents the Traveling Salesman Problem (TSP) instance.

      • actor: Its purpose is to select a small subset of important nodes (up to 200) from the overall TSP problem.

      • critic: The primary function of the critic is to estimate the “value” of a given state.

  • main Parameters of HTSP

    • fragment_length: The length of the sub-fragment.

    • target_steps: The maximum steps in a trajectory.

    • k: The number of neighbors in knn.

  • Lower_slover

    • POMO: It employs the POMO strategypomo to generate solutions for sub-fragments. In particular, Transformer has also been applied in the solver.

    • LKH: A exact solver LKH is used to solve the sub-fragment problem.

Usage

You can run the following command lines to execute the code.

python train.py settings=htsp_settings mode=train model=htsp scale={scale} batch_size={batch_size} episodes={episodes} settings.model.low_env.auto_reset=True
python eval.py settings=htsp_settings mode=test model=htsp scale={scale} batch_size={batch_size} episodes={episodes} setttings.model.low_env.auto_reset=False test_data_path={PATH} 
problem scale k fragment_length target_steps
tsp 1000 40 200 64
2000 40 200 64
5000 40 200 64
10000 40 200 64

Tasks

Supported Tasks: TSP.

Required Data Generator:

Training

DeepACO applies PPO learning class PPOLightning() for training.