The tentative tests we could add: 1. test the llama debug model init and forward/backward works 2. test checkpoint save/load works 3. metrics logging test (metrics to be added)