Skip to content

DataServer changes from working state back to init state  #19

@Synex-wh

Description

@Synex-wh

Describe the bug

  • Sometime we found dataServer changes from working state back to init state during startup,logs eg.:
    image

  • Analysis found that the data startup log found that the first time all the data ip list information obtained from the meta is the first list, this is not in the expected range, because all data startup is started one by one to register the meta node, before the system starts It is to get some or some of the node information one by one to carry out the subsequent operations. This environment is the first time to get the full list. It is probably because the meta raft protocol is caused by the persistence of the registration information before the local disk is performed.
    image

  • Subsequent analysis of the code, the main code logic that causes the working state to change back to the init state is because the current node is already in the working state, and the subsequent incremental ip of the meta list change (relative to the newly added ip node of the existing ip) Including the current node ip will be backed up to init, this design has not been modified in the earliest dataserver code, originally involved in the idea of estimating in order to prevent the current node from breaking the chain for a long time and then the link back state recovery can restore the data, but this is very Inadequate retreat status will cause a lot of bans and other errors to occur according to the subsequent design, so follow up to remove this code, the detailed code is as follows

image

  • Then analyze the data log, the above raft-persistent content gets the history list data to get the full list faster, which causes the current node to become the working state speed ahead of time, so if it becomes working after receiving the meta The pushed list does not contain the current node, and then receives the current list list collection, which will eventually generate the list increment information including the current ip occurrence, which will eventually result in the change to init, for example

    • The current data node ip is 41. If it has become the working state and the meta push list is received as [49, 48], the current change to the subsequent list change will not cause a state change, but will cause The memory list changes to [49, 48]
    • The subsequent meta re-push list is [41, 49, 48], so the list increment for this push relative to the memory list is [41], and it is currently working, so Satisfy the requirement that the above code logic becomes init, and the transformation to the Init state occurs.
  • The above log process is verified as follows:
    image
    image

  • However, the above process has a special point. In the case where the data is already working, the meta will push a list that does not contain the current node [49, 48]. In theory, the current node becomes woring. Already registered on the meta, so each push list should contain 41, continue to analyze the leader node meta log found, after the working has not received the current node 41 renew update invalidation time , causing the meta to be culled by 41, so the list process that does not contain it is pushed, the log is as follows
    image

  • That is, data 41 is not renewed. The analysis log finds that the time interval for initiating the renew task after the first registration is completed exceeds 30s. After the node is registered, the update expired configuration must be exceeded, so the action of rejecting 41 will be triggered for the first time. Once again, renew is pushed and the full list is pushed [41,49,48]

  • This time interval is very long, because the list change content of the first meta push contains the data list of other dataCenter. The current data is to start the renew task after all the data links are performed one by one. This process occurs when other dataCenter links timeout and repeated retry, resulting in This renew task can't be started for more than 30s, which ultimately leads to the above result.
    image
    image

Problem recurrence

  • Do not delete the meta raft persistent directory to keep the data for the first time to get the full list information, quickly into the working, the renew task does not initiate a delay of more than 30s, so the meta has been culled, followed by the woring into init
    image

fix

  • In the dataserver initialization process, the renew current node task is started immediately after the node registration is completed, and the current node is renewed to keep the current node valid, so that the push node information does not include the current node, and eventually the node changes from the working state. Back to init state

  • SOFARegistry version:

  • JVM version (e.g. java -version):

  • OS version (e.g. uname -a):

  • Maven version:

  • IDE version:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions