Skip to content

feature: Log data sizes in load test benchmarks #1949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 30, 2023

Conversation

LeonLuttenberger
Copy link
Contributor

@LeonLuttenberger LeonLuttenberger commented Jan 20, 2023

Feature or Bugfix

  • Feature

Detail

  • Log data sizes in load test benchmarks
  • Create equivalent tests that use Modin and Ray exclusively for S3 IO, in order to compare AWS SDK for pandas performance with Modin/Ray performance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@LeonLuttenberger LeonLuttenberger marked this pull request as ready for review January 25, 2023 21:07
Comment on lines 12 to 28
@pytest.fixture(scope="function")
def df_s() -> pd.DataFrame:
# Data frame with 100000 rows
ray_ds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2010/02/data.parquet")
return ray_ds.to_modin()


@pytest.fixture(scope="function")
def big_modin_df() -> pd.DataFrame:
pandas_refs = ray.data.range_table(100_000).to_pandas_refs()
dataset = ray.data.from_pandas_refs(pandas_refs)

frame = dataset.to_modin()
frame["foo"] = frame.value * 2
frame["bar"] = frame.value % 2

return frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are defined in test_s3 as well, should we just move them into a fixture file?

def test_modin_s3_read_parquet_simple(benchmark_time: float, request: pytest.FixtureRequest) -> None:
path = "s3://ursa-labs-taxi-data/2018/"
with ExecutionTimer(request, data_paths=path) as timer:
ray_ds = ray.data.read_parquet(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we stage data in a Ray dataset first when Modin already has dedicated methods?
The test I had in mind was simply: pd.read_parquet(path) where pd is import modin.pandas as pd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons I replaced read Parquet with the Ray version is that Pandas was having a lot of trouble with the partition style (e.g. the partitions are 01 rather than month=01).

Once we move the load test data into our bucket, we can make sure that the data structure is such that both Modin and AWS SDK for pandas can access it natively: #1962

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant

This comment was marked as outdated.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 845bacc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant

This comment was marked as outdated.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: 845bacc
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubStandardCodeBuild8C06-llutOAimTATs
  • Commit ID: a9c6c00
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubLoadTests5656BB24-ATYtnXPE7MOa
  • Commit ID: 845bacc
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@LeonLuttenberger LeonLuttenberger merged commit b0d7697 into release-3.0.0 Jan 30, 2023
@LeonLuttenberger LeonLuttenberger deleted the distributed/load-test-benchmarks branch January 30, 2023 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants