Skip to content

Cluster train job will hang if there are too many parameter server or ports #2224

@typhoonzero

Description

@typhoonzero

If there are too many parameter servers or too many parameter server ports(or sparse ports), some parameter servers will wait forever.

When parameter start up, ti says:

W0522 12:00:09.495564 35864 ParameterServer2.cpp:269] --ports_num or --ports_num_for_sparse might be too large, or total dense parameter size or sparse parameters size might be too small, this psever doesn't store any parameter.

In ParameterServer2.cpp:

void ParameterServer2::setParameter(const SendParameterRequest& request,
                                    std::vector<Buffer>& inputBuffers,
                                    SendParameterResponse* response,
                                    std::vector<Buffer>* outputBuffers) {
...
if (!request.blocks().size()) {
    LOG(WARNING)
        << "--ports_num or --ports_num_for_sparse might be too large, "
        << "or total dense parameter size or sparse parameters size "
        << "might be too small, this psever doesn't store any parameter.";
    return;
  }

...


void ParameterServer2::addGradient(const SendParameterRequest& request,
                                   std::vector<Buffer>& inputBuffers,
                                   SendParameterResponse* response,
                                   std::vector<Buffer>* outputBuffers) {

if (!numPassFinishClients_) {
    REGISTER_BARRIER_DELTA_SERVER_SET(
        *statSet_,
        "forwardbackwardDelta",
        FLAGS_num_gradient_servers,
        request.trainer_id(),
        request.forwardbackward_time(),
        isSparseServer_ ? "_sparseUpdater" : "_denseUpdater");
  }

It seems that the hanging problem is due to some other reason. But I still need to figure out the details when parameter block is more than pserver instances

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions