Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Add container info to framework #13

Closed
wants to merge 1 commit into from
Closed

Conversation

yyrdl
Copy link

@yyrdl yyrdl commented May 16, 2019

No description provided.

@msftclas
Copy link

msftclas commented May 16, 2019

CLA assistant check
All CLA requirements met.

@yqwang-ms
Copy link
Member

Thanks for the PR. :)

So, why do you need to add ContainerStatuses into TaskAttemptStatus?
They are already in PodStatus, and we can use Namespace, PodName and PodUID to locate it.

@yyrdl
Copy link
Author

yyrdl commented May 21, 2019

Thanks for the PR. :)

So, why do you need to add ContainerStatuses into TaskAttemptStatus?
They are already in PodStatus, and we can use Namespace, PodName and PodUID to locate it.

Thanks for reply ^ _^ 

yeah , I can locate it by pod when the framework is still running . But after the framework is completed , the pod will be deleted by the frameworkcontroller, then we will lose the container information .

@yqwang-ms
Copy link
Member

Yep, this is by design as we have no other choice to stop a running Pod except for deleting it.
But, we only store necessary info about the pod into TaskAttemptStatus, such as pod uid, etc, to store more history info, we plan to store them into a history store, such as a DB, instead of into TaskAttemptStatus which will be eventually stored in ETCD, which is not designed to store too much history data.

@yyrdl
Copy link
Author

yyrdl commented May 21, 2019

Yep, this is by design as we have no other choice to stop a running Pod except for deleting it.
But, we only store necessary info about the pod into TaskAttemptStatus, such as pod uid, etc, to store more history info, we plan to store them into a history store, such as a DB, instead of into TaskAttemptStatus which will be eventually stored in ETCD, which is not designed to store too much history data.

You are right , it is not a good choice to store too many things in ETCD , we store the whole information (config , status , resource usage etc..) of a framework in mysql . When the framework is completed , we will also delete the framework from k8s in rest-server.

In my company , we need to supply the container log to user, then container information is necessary. If we can get the container info from TaskAttemptStatus directly , it will be more convenient. Otherwise , we need to sync the info when it is changed ( created , recreated ...), which is complex .

Besides, we also need "container id " and "node ip" to supply "docker commit" feature.

This PR is just a suggestion :)

@yqwang-ms
Copy link
Member

yqwang-ms commented May 21, 2019

First thanks to try out the frameworkcontroller :)
We (https://github.com/Microsoft/pai) are also starting to migrate from YARN to frameworkcontroller to provide a pure k8s PAI in next few months. We also face the same Framework Status History and Container Log problem.
Even though we have not decided the final design, but you can try below options:

  1. Framework Status History
    If Framework Status changed (such as Task retried with a new Pod), push it (and related K8S objects such as Pod Status) into another store, like DB or just local file system.
    You can also have a history web server to serve these history status.

  2. Container Log
    Each container can write its log to a distributed FS, like HDFS or NFS, with path like:
    /logs/$FC_FRAMEWORK_NAME/$FC_FRAMEWORK_ATTEMPT_INSTANCE_UID/$FC_TASKROLE_NAME/$FC_TASK_INDEX/$FC_TASK_ATTEMPT_INSTANCE_UID/***
    Or
    /logs/$FC_FRAMEWORK_NAME/$FC_FRAMEWORK_ATTEMPT_INSTANCE_UID/$FC_TASKROLE_NAME/$FC_TASK_INDEX/$FC_TASK_ATTEMPT_INSTANCE_UID/{NODE_IP}/{CONTAINER_NAME}/{CONTAINER_ID}/***
    You can also have a log web server to serve these logs.

It seems that 2. can already solve this

In my company , we need to supply the container log to user, then container information is necessary. If we can get the container info from TaskAttemptStatus directly , it will be more convenient. Otherwise , we need to sync the info when it is changed ( created , recreated ...), which is complex .

Anyway to solve all things, you may need both 1 and 2, and combine them together.

For this PR, I still have concerns to store the orignal ContainerStatuses into Framework Status, because it does not totally solve the problem 1 or 2, and the real ground truth ContainerStatuses is in Pod Status, and Framework Status is by design does not store these duplicated info as much as possible.

So, I think we can first hold this PR until we have a final design, is that OK?

@yyrdl
Copy link
Author

yyrdl commented May 21, 2019

yep , it's just a suggestion , we will be happy to see your final design :)

@yqwang-ms
Copy link
Member

Thanks for the understanding, will notify you for the final design.
Currently, you can continue use your own apporach in the PR, I will close it now, and reopen it later on demand.

@yqwang-ms yqwang-ms closed this May 21, 2019
@yqwang-ms
Copy link
Member

yqwang-ms commented Jul 31, 2019

Hi @yyrdl , please check our final design:

  1. PR Support LogObjectSnapshot: Expose Framework and Pod History #31 will let FC to expose pod (and framework) history.
  2. We plan to use EFK to collect the pod (and framework) history in https://github.com/Microsoft/pai.
    You can also do this by yourself.

Any comments? :)

@yyrdl
Copy link
Author

yyrdl commented Jul 31, 2019

Hi @yyrdl , please check our final design:

  1. PR Support LogObjectSnapshot: Expose Framework and Pod History #31 will let FC to expose pod (and framework) history.
  2. We plan to use EFK to collect the pod (and framework) history in https://github.com/Microsoft/pai.
    You can also do this by yourself.

Any comments? :)

哈哈,谢谢提醒,这样确实解决了问题,挺好的 👍

@yqwang-ms
Copy link
Member

🤝

@yqwang-ms
Copy link
Member

yqwang-ms commented Oct 30, 2019

@yyrdl pls refer example for job/pod history in:
microsoft/pai#3626
microsoft/pai#3623

@yyrdl
Copy link
Author

yyrdl commented Nov 11, 2019

@yyrdl pls refer example for job/pod history in:
microsoft/pai#3626
microsoft/pai#3623

:) thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants