HTSP¶

H-TSP is an end-to-end deep reinforcement learning approach that can handle TSP problems with up to 10,000 nodes. H-TSP uses a hierarchical reinforcement learning approach, which means it breaks down the big problem into smaller parts and solves them in two steps.

`HTSPInitialization`¶

Bases: Initialization

Methods:

In training mode, the HTSP policy is executed to collect experiences(explore_trajectory, update_buffer) and use PPO to update actor and crtic network.
In evaluation mode, the HTSP policy sample a point through upper level network, and use this point and its related points to construct a path through lower level solver, which is realized by HTSP_test_step function.

Model¶

The upper level network for HTSP, includes IMPALAEnocder, ActorPPO, CriticPPO.

Network	type	function
IMPALAEncoder	CNN	output a feature map
ActorPPO	MLP	ouput the coordinate of center point
CriticPPO	MLP	ouput the value of state

class IMPALAEncoder(nn.ModuleDict):
    def __init__(
        self,
        input_dim,
        embedding_dim,
        bev_range=None,
        bev_pixel_size=None,
        depths=None,
    **kwargs):

class ActorPPO(nn.Module):
    def __init__(
        self, 
        state_dim, 
        mid_dim, 
        action_dim, 
        init_a_std_log=-0.5, 
        **kwargs):
```python

```python
class CriticPPO(nn.Module):
    def __init__(
        self, 
        state_dim, 
        mid_dim, 
        _action_dim,
        **kwargs):

Policy¶

HTSP_policy
- Backbone
  - encoder: the TSP graph is converted into a pseudo-image by discretizing the 2D space into a grid.The CNN encoder’s function is to process a “pseudo-image” that represents the Traveling Salesman Problem (TSP) instance.
  - actor: Its purpose is to select a small subset of important nodes (up to 200) from the overall TSP problem.
  - critic: The primary function of the critic is to estimate the “value” of a given state.
main Parameters of HTSP
- fragment_length: The length of the sub-fragment.
- target_steps: The maximum steps in a trajectory.
- k: The number of neighbors in knn.
Lower_slover
- POMO: It employs the POMO strategypomo to generate solutions for sub-fragments. In particular, Transformer has also been applied in the solver.
- LKH: A exact solver LKH is used to solve the sub-fragment problem.

Usage¶

You can run the following command lines to execute the code.

python train.py settings=htsp_settings mode=train model=htsp scale={scale} batch_size={batch_size} episodes={episodes} settings.model.low_env.auto_reset=True
python eval.py settings=htsp_settings mode=test model=htsp scale={scale} batch_size={batch_size} episodes={episodes} setttings.model.low_env.auto_reset=False test_data_path={PATH} 

problem	scale	k	fragment_length	target_steps
tsp	1000	40	200	64
	2000	40	200	64
	5000	40	200	64
	10000	40	200	64

Tasks¶

Supported Tasks: TSP.

Required Data Generator:

TSPGenerator

Training¶

DeepACO applies PPO learning class PPOLightning() for training.