From 47626e03a93661d192388f5d1257e4ac92e3ec7f Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Tue, 7 Jan 2020 23:50:09 +0800
Subject: [PATCH 01/13] add draft hp ps design

---
 docs/designs/high_performance_ps.md | 46 +++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)
 create mode 100644 docs/designs/high_performance_ps.md

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
new file mode 100644
index 000000000..f4f95e0cb
--- /dev/null
+++ b/docs/designs/high_performance_ps.md
@@ -0,0 +1,46 @@
+# High Performance Parameter Server Design
+
+
+## Motivation
+
+This design doc focus on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Both IO workload and CPU workload could be very high.
+
+The current PS is implemented with Python. Because of `GIL` of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores on PS.
+
+Usually, the first thing that comes to mind is using C++ to re-implement such a high performance parameter server. But we have some concerns on the development efficiency of C++. Golang is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Golang is competent for the job and substitute C++.
+
+
+## Communication
+
+The PS provides services to workers with gRPC library. Both C++ and Go are well supported. The development efficiency of C++ could be some less than Go.
+
+## Computation
+
+The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally an operation to tensors.
+
+### Tensor
+
+We have to support both dense tensor and sparse tensor. Besides, different element data types are also needed, such as int8/int32/float16/float32/float64. Int8 and float16 is used in training based quantization. The tensor operators have to support different data types.
+
+C++ supports generic with template programming, while Go does not support generic directly.
+
+### Math library
+
+There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, eigen is used in TensorFlow and Paddle, aten is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the optimization operators could be implemented easily. 
+
+*Go part TBD*
+
+Need to survey in further. Generally, the math library ecology of Go is far from competing to C++.
+
+
+## Scheduling
+
+In C++, we use thread based scheduling. Threads are scheduled by OS. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could binds a thread to a certain CPU core by setting CPU affinity to a thread. It will increase the cache hit rate of CPU cores.
+
+In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. When an event like IO/function call/channel/runtime.Goshed() happens, the goroutine could be switched. There is a possibility that IO goroutines could not be scheduled for a while if all the CPU cores are occpuied by computation goroutines. We will do some experiments to check this.
+
+
+
+

From 5900f65e4793c8a679153b95b01c75d6f5443102 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Tue, 7 Jan 2020 23:54:41 +0800
Subject: [PATCH 02/13] add draft hp ps design

---
 docs/designs/high_performance_ps.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index f4f95e0cb..6745a6756 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -5,20 +5,20 @@
 
 This design doc focus on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
 
-PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Both IO workload and CPU workload could be very high.
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from mant workers, both IO workload and CPU workload would be very high.
 
-The current PS is implemented with Python. Because of `GIL` of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores on PS.
+The current PS is implemented with Python. Because of `GIL` of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores capability of PS.
 
-Usually, the first thing that comes to mind is using C++ to re-implement such a high performance parameter server. But we have some concerns on the development efficiency of C++. Golang is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Golang is competent for the job and substitute C++.
+Usually, the first thing that comes to mind is using C++ to re-implement such a high performance parameter server. But we have some concerns on the development efficiency of C++. Golang is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Golang is competent for the job and could substitute C++.
 
 
 ## Communication
 
-The PS provides services to workers with gRPC library. Both C++ and Go are well supported. The development efficiency of C++ could be some less than Go.
+The PS provides services to workers with gRPC library. Both C++ and Go are well supported in gRPC. The development efficiency of C++ could be some less than Go.
 
 ## Computation
 
-The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally an operation to tensors.
+The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally an operation of tensors.
 
 ### Tensor
 

From a8aad1548dbda1a7d287ba507177411a1daefa5a Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Wed, 8 Jan 2020 12:09:07 +0800
Subject: [PATCH 03/13] update

---
 docs/designs/high_performance_ps.md | 35 ++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index 6745a6756..308435e25 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -1,6 +1,5 @@
 # High Performance Parameter Server Design
 
-
 ## Motivation
 
 This design doc focus on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
@@ -9,8 +8,7 @@ PS receives gradients from workers, applies gradients to parameters, and sends t
 
 The current PS is implemented with Python. Because of `GIL` of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores capability of PS.
 
-Usually, the first thing that comes to mind is using C++ to re-implement such a high performance parameter server. But we have some concerns on the development efficiency of C++. Golang is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Golang is competent for the job and could substitute C++.
-
+Usually, the first thing that comes to mind is using C++ to reimplement such a high performance parameter server. But we have some concerns on the development efficiency of C++. Golang is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Golang is competent for the job and could substitute C++.
 
 ## Communication
 
@@ -22,25 +20,42 @@ The gradients and parameters on PS are represented by tensors. And applying grad
 
 ### Tensor
 
-We have to support both dense tensor and sparse tensor. Besides, different element data types are also needed, such as int8/int32/float16/float32/float64. Int8 and float16 is used in training based quantization. The tensor operators have to support different data types.
+We have to support both dense tensor and sparse tensor. Besides, different element data types are also needed, such as int8/int32/float16/float32/float64. Int8 and float16 are used in training based quantization.
 
-C++ supports generic with template programming, while Go does not support generic directly.
+Each tensor operator has to support different data types. C++ supports generics with template programming, while Go does not support generics directly.
 
 ### Math library
 
-There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, eigen is used in TensorFlow and Paddle, aten is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the optimization operators could be implemented easily. 
+There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the optimization operators could be implemented easily and efficiently.
 
-*Go part TBD*
-
-Need to survey in further. Generally, the math library ecology of Go is far from competing to C++.
+It seems that there are few math libraries in Go. [gosl](https://github.com/cpmech/gosl) is not active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also has some faint worry with the performance of math libraries in Go.
 
 
 ## Scheduling
 
 In C++, we use thread based scheduling. Threads are scheduled by OS. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could binds a thread to a certain CPU core by setting CPU affinity to a thread. It will increase the cache hit rate of CPU cores.
 
-In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. When an event like IO/function call/channel/runtime.Goshed() happens, the goroutine could be switched. There is a possibility that IO goroutines could not be scheduled for a while if all the CPU cores are occpuied by computation goroutines. We will do some experiments to check this.
+In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen on one of these events. It means the scheduler gets the opportunity.
+
+- The use of the keyword `go`
+- Garbage collection
+- System calls
+- Synchronization and Orchestration
+
+The Go scheduler requires well-defined user-space events that occur at safe points in the code to context-switch from.  These events and safe points manifest themselves within function calls. If any tight loops are running without making function calls, it will cause latencies within the scheduler and garbage collection. It’s critically important that function calls happen within reasonable timeframes.
+
+There are also discussions on Go community:
+
+- [issue 10958, runtime: tight loops should be preemptible](https://github.com/golang/go/issues/10958)
+- [issue 36365, runtime: clean up async preemption loose ends](https://github.com/golang/go/issues/36365)
 
+It seems that the problems is addressed in Go 1.14 and Go 1.15. But the stable versions are not released yet.
 
+The optimization in deep learning is actually a tight loop. A gradient tensor with 10000 elements has to be applied to a parameter tensor with 10000 elements. Optimization usually involes a lot of element-wise operations. There is a possibility that IO goroutines could not be scheduled for a while if all the CPU cores are occpuied by computation goroutines. We will do some experiments to check this.
 
+## Reference
 
+- https://gitlab.com/libeigen/eigen
+- https://github.com/cpmech/gosl
+- https://github.com/gonum/gonum
+- https://www.ardanlabs.com/blog/2018/08/scheduling-in-go-part2.html
\ No newline at end of file

From b666ecf1bcbf50116a90c3d46577eb5e20cba9ee Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Wed, 8 Jan 2020 12:10:39 +0800
Subject: [PATCH 04/13] format

---
 docs/designs/high_performance_ps.md | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index 308435e25..cde889598 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -44,14 +44,16 @@ In Go, there is no concept of thread, we use goroutine instead. Goroutines are s
 
 The Go scheduler requires well-defined user-space events that occur at safe points in the code to context-switch from.  These events and safe points manifest themselves within function calls. If any tight loops are running without making function calls, it will cause latencies within the scheduler and garbage collection. It’s critically important that function calls happen within reasonable timeframes.
 
-There are also discussions on Go community:
+There are also some discussions on Go community:
 
 - [issue 10958, runtime: tight loops should be preemptible](https://github.com/golang/go/issues/10958)
 - [issue 36365, runtime: clean up async preemption loose ends](https://github.com/golang/go/issues/36365)
 
-It seems that the problems is addressed in Go 1.14 and Go 1.15. But the stable versions are not released yet.
+It seems that this problem is addressed in Go 1.14 and Go 1.15. But the stable versions are not released yet.
 
-The optimization in deep learning is actually a tight loop. A gradient tensor with 10000 elements has to be applied to a parameter tensor with 10000 elements. Optimization usually involes a lot of element-wise operations. There is a possibility that IO goroutines could not be scheduled for a while if all the CPU cores are occpuied by computation goroutines. We will do some experiments to check this.
+The optimization in deep learning is actually a tight loop. For example, a gradient tensor with 10000 elements has to be applied to a parameter tensor with 10000 elements. Optimization usually involes a lot of element-wise operations.
+
+There is a possibility that IO goroutines could not be scheduled for a while if all the CPU cores are occpuied by computation goroutines. We will do some experiments to check this.
 
 ## Reference
 

From a09b9a4739c7ba7f121a08337366734395f65199 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Wed, 8 Jan 2020 12:14:02 +0800
Subject: [PATCH 05/13] refine doc

---
 docs/designs/high_performance_ps.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index cde889598..7c073cd8b 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -49,7 +49,7 @@ There are also some discussions on Go community:
 - [issue 10958, runtime: tight loops should be preemptible](https://github.com/golang/go/issues/10958)
 - [issue 36365, runtime: clean up async preemption loose ends](https://github.com/golang/go/issues/36365)
 
-It seems that this problem is addressed in Go 1.14 and Go 1.15. But the stable versions are not released yet.
+It seems that this problem is addressed in Go 1.14, and there are still some issues to do in Go 1.15. But the stable version of Go 1.14 is not released yet.
 
 The optimization in deep learning is actually a tight loop. For example, a gradient tensor with 10000 elements has to be applied to a parameter tensor with 10000 elements. Optimization usually involes a lot of element-wise operations.
 

From b8e9eb958422b9594dad30026b2b8ed9f0c81cc2 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Wed, 8 Jan 2020 13:40:14 +0800
Subject: [PATCH 06/13] fix typos

---
 docs/designs/high_performance_ps.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index 7c073cd8b..495bb0d4b 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -4,11 +4,11 @@
 
 This design doc focus on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
 
-PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from mant workers, both IO workload and CPU workload would be very high.
+PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
 
-The current PS is implemented with Python. Because of `GIL` of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores capability of PS.
+The current PS is implemented with Python. Because of `GIL` of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.
 
-Usually, the first thing that comes to mind is using C++ to reimplement such a high performance parameter server. But we have some concerns on the development efficiency of C++. Golang is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Golang is competent for the job and could substitute C++.
+Usually, the first thing that comes to mind is using C++ to reimplement a high performance parameter server. But we have some concerns on the development efficiency of C++. Go is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Go is competent for the job and could substitute C++.
 
 ## Communication
 
@@ -16,7 +16,7 @@ The PS provides services to workers with gRPC library. Both C++ and Go are well
 
 ## Computation
 
-The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally an operation of tensors.
+The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally a math operation of tensors.
 
 ### Tensor
 
@@ -26,14 +26,14 @@ Each tensor operator has to support different data types. C++ supports generics
 
 ### Math library
 
-There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the optimization operators could be implemented easily and efficiently.
+There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the operators in optimizers could be implemented easily and efficiently.
 
-It seems that there are few math libraries in Go. [gosl](https://github.com/cpmech/gosl) is not active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also has some faint worry with the performance of math libraries in Go.
+It seems that there are few math libraries in Go. [gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also has some faint worry with the performance of math libraries in Go.
 
 
 ## Scheduling
 
-In C++, we use thread based scheduling. Threads are scheduled by OS. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could binds a thread to a certain CPU core by setting CPU affinity to a thread. It will increase the cache hit rate of CPU cores.
+In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
 
 In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen on one of these events. It means the scheduler gets the opportunity.
 
@@ -49,9 +49,9 @@ There are also some discussions on Go community:
 - [issue 10958, runtime: tight loops should be preemptible](https://github.com/golang/go/issues/10958)
 - [issue 36365, runtime: clean up async preemption loose ends](https://github.com/golang/go/issues/36365)
 
-It seems that this problem is addressed in Go 1.14, and there are still some issues to do in Go 1.15. But the stable version of Go 1.14 is not released yet.
+It seems that this problem is addressed partially in Go 1.14, and there are still some issues to do in Go 1.15. But the stable version of Go 1.14 is not released yet.
 
-The optimization in deep learning is actually a tight loop. For example, a gradient tensor with 10000 elements has to be applied to a parameter tensor with 10000 elements. Optimization usually involes a lot of element-wise operations.
+The optimization in deep learning is actually a tight loop. For example, a gradient tensor with 10000 elements has to be applied to a parameter tensor with 10000 elements. Optimization usually involes a lot of element-wise tensor operations.
 
 There is a possibility that IO goroutines could not be scheduled for a while if all the CPU cores are occpuied by computation goroutines. We will do some experiments to check this.
 

From 60a57747e62b6c8620355151967f371675ca2fdc Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Wed, 8 Jan 2020 13:47:04 +0800
Subject: [PATCH 07/13] update links

---
 docs/designs/high_performance_ps.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index 495bb0d4b..ba0c3b471 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -6,7 +6,7 @@ This design doc focus on implementing a high performance parameter server(short
 
 PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
 
-The current PS is implemented with Python. Because of `GIL` of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.
+The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.
 
 Usually, the first thing that comes to mind is using C++ to reimplement a high performance parameter server. But we have some concerns on the development efficiency of C++. Go is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Go is competent for the job and could substitute C++.
 

From 04b8d07c7484f0d705ff57b96248d72a70853d5a Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Fri, 10 Jan 2020 10:18:12 +0800
Subject: [PATCH 08/13] update

---
 docs/designs/high_performance_ps.md | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index ba0c3b471..82774e13d 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -2,17 +2,17 @@
 
 ## Motivation
 
-This design doc focus on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
 
 PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
 
 The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.
 
-Usually, the first thing that comes to mind is using C++ to reimplement a high performance parameter server. But we have some concerns on the development efficiency of C++. Go is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Go is competent for the job and could substitute C++.
+Usually, the first thing that comes to mind is using C++ to reimplement a high performance parameter server. But we have some concerns on the development efficiency of C++. Go is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Go is competent for the job and could substitute C++ in all or in part.
 
 ## Communication
 
-The PS provides services to workers with gRPC library. Both C++ and Go are well supported in gRPC. The development efficiency of C++ could be some less than Go.
+The PS provides services to workers with gRPC library. Both C++ and Go are well supported in gRPC. Go has better development efficiency than C++.
 
 ## Computation
 
@@ -30,6 +30,8 @@ There are different kinds of optimizers, which need some tensor operations. Ther
 
 It seems that there are few math libraries in Go. [gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also has some faint worry with the performance of math libraries in Go.
 
+[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code.
+
 
 ## Scheduling
 
@@ -42,18 +44,14 @@ In Go, there is no concept of thread, we use goroutine instead. Goroutines are s
 - System calls
 - Synchronization and Orchestration
 
-The Go scheduler requires well-defined user-space events that occur at safe points in the code to context-switch from.  These events and safe points manifest themselves within function calls. If any tight loops are running without making function calls, it will cause latencies within the scheduler and garbage collection. It’s critically important that function calls happen within reasonable timeframes.
-
-There are also some discussions on Go community:
+Go supports concurrent programming well with first-class concepts, goroutine and channel.
 
-- [issue 10958, runtime: tight loops should be preemptible](https://github.com/golang/go/issues/10958)
-- [issue 36365, runtime: clean up async preemption loose ends](https://github.com/golang/go/issues/36365)
+## Conclusion
 
-It seems that this problem is addressed partially in Go 1.14, and there are still some issues to do in Go 1.15. But the stable version of Go 1.14 is not released yet.
 
-The optimization in deep learning is actually a tight loop. For example, a gradient tensor with 10000 elements has to be applied to a parameter tensor with 10000 elements. Optimization usually involes a lot of element-wise tensor operations.
+Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation in C++.
 
-There is a possibility that IO goroutines could not be scheduled for a while if all the CPU cores are occpuied by computation goroutines. We will do some experiments to check this.
+The optimization operators will be implemented in C++, wrappered with C interface and exposed to Go. The receiving gradients and sending parameters service are implemented in Go. Once receving gradients from a worker, a gouroutine will be launched to do optimization.
 
 ## Reference
 

From ef0aefa8e784dc56358050e8c7b4781ec3092423 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Fri, 10 Jan 2020 10:22:21 +0800
Subject: [PATCH 09/13] update

---
 docs/designs/high_performance_ps.md | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index 82774e13d..52d9bc71c 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -30,9 +30,6 @@ There are different kinds of optimizers, which need some tensor operations. Ther
 
 It seems that there are few math libraries in Go. [gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also has some faint worry with the performance of math libraries in Go.
 
-[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code.
-
-
 ## Scheduling
 
 In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
@@ -48,10 +45,11 @@ Go supports concurrent programming well with first-class concepts, goroutine and
 
 ## Conclusion
 
-
 Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation in C++.
 
-The optimization operators will be implemented in C++, wrappered with C interface and exposed to Go. The receiving gradients and sending parameters service are implemented in Go. Once receving gradients from a worker, a gouroutine will be launched to do optimization.
+[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.
+
+The receiving gradients and sending parameters service are implemented in Go. Once receving gradients from a worker, a gouroutine will be launched to do optimization.
 
 ## Reference
 

From eaef642094b727279d4de512a1e479dee0cc487e Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Fri, 10 Jan 2020 10:25:40 +0800
Subject: [PATCH 10/13] fix typo

---
 docs/designs/high_performance_ps.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index 52d9bc71c..ba517a467 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -49,7 +49,7 @@ Considering the tradeoff between development efficiency and program peformance,
 
 [Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.
 
-The receiving gradients and sending parameters service are implemented in Go. Once receving gradients from a worker, a gouroutine will be launched to do optimization.
+The receiving gradients and sending parameters service are implemented in Go. Once receving gradients from a worker, a goroutine will be launched to do optimization.
 
 ## Reference
 

From 3e470e5c0b363cb8a267014a59b912b00785926d Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Fri, 10 Jan 2020 13:59:33 +0800
Subject: [PATCH 11/13] polish doc

---
 docs/designs/high_performance_ps.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index ba517a467..37e981535 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -2,11 +2,11 @@
 
 ## Motivation
 
-This design doc focuses on implementing a high performance parameter server(short for PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
+This design doc focuses on implementing a high performance parameter server (PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
 
-PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters bring IO workload to PS, and applying gradients to parameters brings CPU workload to PS. Since one PS could receive gradients from many workers, both IO workload and CPU workload would be very heavy.
+The PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters are primary I/O workloads of the PS, and parameters updating cost CPU resource. Since one PS could receive gradients from more than one worker, both I/O workload and CPU workload could be heavy.
 
-The current PS is implemented with Python. Because of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. To resolve this bottleneck, we have to fully use multi CPU cores of PS.
+The current PS is in Python. Due to the existence of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. We want to remove this bottleneck and make full utilization of multiple CPU cores.
 
 Usually, the first thing that comes to mind is using C++ to reimplement a high performance parameter server. But we have some concerns on the development efficiency of C++. Go is another potential choice. In this doc, we will go through the key points of implementing a high performance parameter server to see if Go is competent for the job and could substitute C++ in all or in part.
 
@@ -28,7 +28,7 @@ Each tensor operator has to support different data types. C++ supports generics
 
 There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the operators in optimizers could be implemented easily and efficiently.
 
-It seems that there are few math libraries in Go. [gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also has some faint worry with the performance of math libraries in Go.
+It seems that there are few math libraries in Go. [Gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also have some faint worry with the performance of math libraries in Go.
 
 ## Scheduling
 
@@ -45,7 +45,7 @@ Go supports concurrent programming well with first-class concepts, goroutine and
 
 ## Conclusion
 
-Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation in C++.
+Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation part in C++.
 
 [Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.
 

From 5efb0fcd98f925f27b4d0a415b9fe2693857bdd3 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Sun, 12 Jan 2020 14:27:59 +0800
Subject: [PATCH 12/13] fix typo

---
 docs/designs/high_performance_ps.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index 37e981535..cf730de8c 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -16,7 +16,7 @@ The PS provides services to workers with gRPC library. Both C++ and Go are well
 
 ## Computation
 
-The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is acutally a math operation of tensors.
+The gradients and parameters on PS are represented by tensors. And applying gradients to parameters, which is also called optimization, is actually a math operation of tensors.
 
 ### Tensor
 
@@ -28,13 +28,13 @@ Each tensor operator has to support different data types. C++ supports generics
 
 There are different kinds of optimizers, which need some tensor operations. There are many mature math libraries developed with C++. For example, [eigen](https://gitlab.com/libeigen/eigen) is used in TensorFlow and Paddle, [aten](https://github.com/pytorch/pytorch/tree/master/aten) is used in Pytorch. These math libraries provide abundant tensor operators and support both CPU and GPU. Besides, these math libraries could call some state-of-the-art blas libraries internally, such as MKL and cuBLAS. With these math libraries, the operators in optimizers could be implemented easily and efficiently.
 
-It seems that there are few math libraries in Go. [Gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also have some faint worry with the performance of math libraries in Go.
+It seems that there are few math libraries in Go. [Gosl](https://github.com/cpmech/gosl) is no longer active, and [gonum](https://github.com/gonum/gonum) does not support MKL. Generally, the math library ecology of Go is far from competing to C++. And we also have some worry with the performance of math libraries in Go.
 
 ## Scheduling
 
-In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimzation will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
+In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimization will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
 
-In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen on one of these events. It means the scheduler gets the opportunity.
+In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen in one of these events. It means the scheduler gets the opportunity.
 
 - The use of the keyword `go`
 - Garbage collection
@@ -45,11 +45,11 @@ Go supports concurrent programming well with first-class concepts, goroutine and
 
 ## Conclusion
 
-Considering the tradeoff between development efficiency and program peformance, we plan to put communication and scheduling parts in Go, and computation part in C++.
+Considering the tradeoff between development efficiency and program performance, we plan to put communication and scheduling parts in Go, and computation part in C++.
 
-[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrappered with C interface, and exposed to Go.
+[Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrapped with C interface, and exposed to Go.
 
-The receiving gradients and sending parameters service are implemented in Go. Once receving gradients from a worker, a goroutine will be launched to do optimization.
+The receiving gradients and sending parameters service are implemented in Go. Once receiving gradients from a worker, a goroutine will be launched to do optimization.
 
 ## Reference
 

From 3e9457915cc129b101f7570db933c48de3131835 Mon Sep 17 00:00:00 2001
From: qijun <qijun1994@hotmail.com>
Date: Mon, 13 Jan 2020 17:44:48 +0800
Subject: [PATCH 13/13] polish doc

---
 docs/designs/high_performance_ps.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/designs/high_performance_ps.md b/docs/designs/high_performance_ps.md
index cf730de8c..1969919c6 100644
--- a/docs/designs/high_performance_ps.md
+++ b/docs/designs/high_performance_ps.md
@@ -4,7 +4,7 @@
 
 This design doc focuses on implementing a high performance parameter server (PS). For the functionality of the PS, please refer to this [design doc](https://github.com/sql-machine-learning/elasticdl/blob/develop/docs/designs/parameter_server.md)
 
-The PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters are primary I/O workloads of the PS, and parameters updating cost CPU resource. Since one PS could receive gradients from more than one worker, both I/O workload and CPU workload could be heavy.
+The PS receives gradients from workers, applies gradients to parameters, and sends the latest parameters to workers. Receiving gradients and sending parameters are primary I/O workloads of the PS, and updating parameters cost CPU resource. Since one PS could receive gradients from more than one worker, both I/O workload and CPU workload could be heavy.
 
 The current PS is in Python. Due to the existence of [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) of Python, gradients are applied to parameters sequentially with only one CPU core. As a result, the receiving gradients service is also blocked, and waiting for current gradients to be consumed. We want to remove this bottleneck and make full utilization of multiple CPU cores.
 
@@ -34,7 +34,7 @@ It seems that there are few math libraries in Go. [Gosl](https://github.com/cpme
 
 In C++, we use thread based scheduling. Threads are scheduled by the operating system. Usually, we will implement a thread pool for computation, and another thread pool for IO. The parameter optimization will be processed by the computation thread pool in parallel. In further, to reduce the overhead of context switching, we could bind a thread to a certain CPU core by setting CPU affinity to the thread. It will increase the cache hit rate of a CPU core.
 
-In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs that allow the scheduler to make scheduling decisions. This does not mean it will always happen in one of these events. It means the scheduler gets the opportunity.
+In Go, there is no concept of thread, we use goroutine instead. Goroutines are scheduled by Go runtime. Goroutine is not preemptive. There are four classes of events that occur in Go programs and allow the scheduler to make scheduling decisions. This does not mean it will always happen in one of these events. It means the scheduler gets the opportunity.
 
 - The use of the keyword `go`
 - Garbage collection
@@ -49,7 +49,7 @@ Considering the tradeoff between development efficiency and program performance,
 
 [Cgo](https://golang.org/cmd/cgo/) enables the creation of Go packages that call C code. And the overhead of cgo is slight. The optimization operators will be implemented in C++, wrapped with C interface, and exposed to Go.
 
-The receiving gradients and sending parameters service are implemented in Go. Once receiving gradients from a worker, a goroutine will be launched to do optimization.
+The receiving gradients and sending parameters services are implemented in Go. Once receiving gradients from a worker, a goroutine will be launched to do optimization.
 
 ## Reference