Skip to content

Rewrite pairwise to remove concatenation from blockwise #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 8, 2021

Conversation

aktech
Copy link
Contributor

@aktech aktech commented Jan 28, 2021

This rewrites pairwise distance to fix the concatenation in blockwise. The implementation is motivated from dask's matmul PR.

Fixes #375

I ran this on MalariaGEN data, it took nearly the same time as blocks implementation on 50 Workers, 16 GB each. The time comparions can be seen in the notebook here

The dask reports can be seen here:

Blocks implementation:

Blockwise implementation:

Important note: These times are only rough estimates for comparison, not for quoting. When I ran the blocks implementation on the same configuration of coiled cloud couple of months ago it took ~ 10 min, now its taking more, since workers are dying due to: OSError: Timed out during handshake while connecting to tls:.... I am not sure why is that happening since no code change has bee made in the blocks implementation. I am communicating with coiled team to find the root cause at the moment.

cc @ravwojdyla

Update

A quick performance for correlation on only 10% of malariaGEN data:

TODO:

  • Remove notebooks and performance reports before merging.

@aktech aktech force-pushed the pairwise-via-blockwise branch from c835e07 to 9f3c549 Compare January 28, 2021 14:52
Copy link
Collaborator

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @aktech! Scanned this PR, looks good, some small comments below, but nothing critical. As discussed in the chat, it would be nice to see the correlation metric performance report between blocks and blockwise.

@aktech aktech force-pushed the pairwise-via-blockwise branch 2 times, most recently from 004205e to 289d550 Compare February 2, 2021 23:14
@aktech aktech marked this pull request as ready for review February 2, 2021 23:49
@aktech
Copy link
Contributor Author

aktech commented Feb 2, 2021

@ravwojdyla I think I have addressed all the comments. I have added a quick performance test for correlation metric as well (See PR description above). I was initially seeing huge difference in performance between blockwise and blocks implementation, as in the blockwise was incredibly slower.

After the analysis of the dask performance reports, I found out the compute time was same and the issue as due to serialisation/deserialisation as I had included the np.empty(..) repeatedly and lambdas in the implementation which increased the serialisation/deserialisation significantly. Now I have fixed those issues and blockwise is as performant as blocks implementation.

@aktech
Copy link
Contributor Author

aktech commented Feb 3, 2021

pre-commit seems to fail for unrelated changes:

sgkit/io/utils.py:186: error: Incompatible types in assignment (expression has type "number[Any]", variable has type "int")  [assignment]
sgkit/window.py:118: error: Item "ndarray" of "Union[ndarray, Any]" has no attribute "chunks"  [union-attr]
sgkit/stats/regenie.py:657: error: Incompatible types in assignment (expression has type "ndarray", variable has type "Optional[Sequence[float]]")  [assignment]
sgkit/stats/regenie.py:699: error: Argument "alphas" to "_stage_1" has incompatible type "Optional[Sequence[float]]"; expected "Optional[ndarray]"  [arg-type]
sgkit/stats/regenie.py:704: error: Argument "alphas" to "_stage_2" has incompatible type "Optional[Sequence[float]]"; expected "Optional[ndarray]"  [arg-type]
sgkit/stats/hwe.py:51: error: unused 'type: ignore' comment
sgkit/stats/hwe.py:94: error: unused 'type: ignore' comment
sgkit/stats/pca.py:133: error: Argument 1 to "__call__" of "ufunc" has incompatible type "DataArray"; expected "Union[Union[int, float, complex, str, bytes, generic], Sequence[Union[int, float, complex, str, bytes, generic]], Sequence[Sequence[Any]], _SupportsArray]"  [arg-type]
sgkit/io/vcfzarr_reader.py:199: error: Item "ndarray" of "Union[ndarray, Any]" has no attribute "values"  [union-attr]
sgkit/tests/test_pca.py:32: error: Returning Any from function declared to return "ndarray"  [no-any-return]
sgkit/io/bgen/bgen_reader.py:148: error: Returning Any from function declared to return "ndarray"  [no-any-return]
sgkit/io/bgen/bgen_reader.py:365: error: Non-overlapping container check (element type: "dtype[Any]", container item type: "Type[unsignedinteger[Any]]")  [comparison-overlap]
sgkit/tests/test_preprocessing.py:88: error: List item 0 has incompatible type "float"; expected "int"  [list-item]
sgkit/tests/test_preprocessing.py:124: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?  [attr-defined]
sgkit/tests/test_preprocessing.py:162: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?  [attr-defined]
sgkit/io/plink/plink_reader.py:87: error: Returning Any from function declared to return "ndarray"  [no-any-return]
sgkit/io/plink/plink_reader.py:268: error: Item "ndarray" of "Union[ndarray, Any]" has no attribute "compute"  [union-attr]
Found 17 errors in 10 files (checked 87 source files)

@tomwhite
Copy link
Collaborator

tomwhite commented Feb 3, 2021

@aktech I filed #451 for these (unrelated) failures

@mergify
Copy link
Contributor

mergify bot commented Feb 3, 2021

This PR has conflicts, @aktech please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Feb 3, 2021
@aktech aktech force-pushed the pairwise-via-blockwise branch from 4c33143 to 92898e0 Compare February 4, 2021 14:21
@mergify mergify bot removed the conflict PR conflict label Feb 4, 2021
@aktech aktech force-pushed the pairwise-via-blockwise branch from 92898e0 to 1945e14 Compare February 4, 2021 16:05
@aktech aktech force-pushed the pairwise-via-blockwise branch from c42e8c1 to dd0c537 Compare February 5, 2021 02:41
@aktech
Copy link
Contributor Author

aktech commented Feb 5, 2021

The tests failures are unrelated, comments have been addressed. @ravwojdyla

@aktech aktech requested a review from ravwojdyla February 5, 2021 03:12
@aktech aktech force-pushed the pairwise-via-blockwise branch from 37af78b to e754773 Compare February 5, 2021 09:46
@codecov-io
Copy link

codecov-io commented Feb 5, 2021

Codecov Report

Merging #447 (e754773) into master (9150392) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #447   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           34        34           
  Lines         2361      2384   +23     
=========================================
+ Hits          2361      2384   +23     
Impacted Files Coverage Δ
sgkit/distance/api.py 100.00% <100.00%> (ø)
sgkit/distance/metrics.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9150392...e754773. Read the comment docs.

Copy link
Collaborator

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @aktech! (please remove the notebooks before the merge)

@aktech
Copy link
Contributor Author

aktech commented Feb 5, 2021

@ravwojdyla Thanks for guiding me throughout, I have removed the notebooks.

@ravwojdyla ravwojdyla merged commit d040e7e into sgkit-dev:master Feb 8, 2021
@aktech aktech deleted the pairwise-via-blockwise branch February 8, 2021 11:23
@aktech aktech restored the pairwise-via-blockwise branch February 9, 2021 12:59
@aktech aktech deleted the pairwise-via-blockwise branch February 15, 2021 17:27
@aktech aktech restored the pairwise-via-blockwise branch February 15, 2021 17:27
@aktech aktech deleted the pairwise-via-blockwise branch March 16, 2021 20:28
@aktech aktech restored the pairwise-via-blockwise branch March 16, 2021 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pairwise distance scalability
4 participants