Skip to content

Conversation

frcroth
Copy link
Contributor

@frcroth frcroth commented May 26, 2025

URL of deployed dev instance (used for testing):

  • https://___.webknossos.xyz (No frontend)

This PR adds routes in the datastore to access datasets by ID as opposed to by organizationId and directory. Datasources are cached in the datastore, the configuration is generated by WK from the DB. WK can also invalidate the DS cache when the dataset was changed. This PR only supports remote datasets, and especially no WKW datasets.

This adds new routes with the "wkDataset" path, since the dataset routes directly follow with the organizationId, making routes that use the datasetId there ambiguous. Zarr routes are also duplicated to use dataset ids. In a followup, once wkw datasets are also supported, the old routes should be removed and the new ones can be renamed to have nicer sounding routes.

Steps to test:

  • In the two changed frontend files (rest_api.ts and wkstore_adapter.ts) change the datasetId to a remote dataset you have in your WK.
  • Open a dataset (preferably the same dataset so that metadata matches) and see data, histogram.
  • Explore a remote dataset as a zarr dataset (e.g. exploring http://localhost:9000/data/wkZarr/683d9643f600008d0e26a0a3)

TODOs:

  • Cache invalidation
  • Remove todos
  • Revert frontend changes

Issues:


(Please delete unneeded items, merge only when none are left open)

  • Updated changelog No user facing changes for now
  • Adapted wk-libs python client if relevant API parts change API only appended right now
  • Removed dev-only changes like prints and application.conf edits
  • Considered common edge cases
  • Needs datastore update after deployment

Copy link
Contributor

coderabbitai bot commented May 26, 2025

📝 Walkthrough

Walkthrough

This update introduces enhanced dataset access control, new dataset cache and attachment management, and expanded API endpoints for dataset data access and cache invalidation. It adds support for dataset layer attachments, detailed data source representations, and integrates these features into both the main application and the datastore service, with corresponding updates to public APIs and access token handling.

Changes

Files/Paths Change Summary
app/controllers/UserTokenController.scala, webknossos-datastore/app/com/scalableminds/.../AccessTokenService.scala Added dataset-specific access control methods and resource type; updated access validation to support datasets.
app/controllers/WKRemoteDataStoreController.scala, conf/webknossos.latest.routes Added API endpoint and controller action to fetch a datasource by dataset ID with access validation.
app/models/dataset/Dataset.scala Added support for dataset mag parsing, dataset layer attachments, and extended data layer metadata; introduced new DAO for layer attachments.
app/models/dataset/DatasetService.scala Introduced method for detailed data source representation with mags; added datastore update notification after data source changes.
app/models/dataset/WKRemoteDataStoreClient.scala, webknossos-datastore/app/com/scalableminds/.../DSRemoteWebknossosClient.scala Added methods for updating dataset cache and fetching datasets from remote webknossos.
app/models/annotation/AnnotationService.scala, webknossos-datastore/app/com/scalableminds/.../AnnotationSource.scala Added datasetId field to AnnotationSource and serialization logic.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala Introduced DatasetCache class for caching and retrieving datasets and layers, with manual invalidation.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKDatasetController.scala Added new controller for dataset data access, supporting voxel data, thumbnails, histograms, and more, with access validation.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala Added cache invalidation endpoint and injected DatasetCache; minor code cleanup.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala Added isEmpty method to DatasetLayerAttachments.
webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes Added new REST API routes for dataset data access and cache invalidation.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala Bound DatasetCache as an eager singleton in DI module.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/BinaryDataController.scala Removed trailing comma in constructor parameter list (formatting only).

Suggested labels

backend, new feature

Suggested reviewers

  • fm3

Poem

🐇 In data fields where pixels play,
New caches bloom to save the day.
Layers attach with mags in line,
Access guarded, routes align.
A hop through code, a joyful cheer,
Dataset magic, crystal clear!
🥕✨ Hoppity-hop, the future’s here!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 16dfb39 and 89f0eb8.

📒 Files selected for processing (1)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build-smoketest-push
  • GitHub Check: frontend-tests
  • GitHub Check: backend-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@frcroth frcroth requested a review from fm3 June 4, 2025 12:34
@frcroth frcroth marked this pull request as ready for review June 4, 2025 12:34
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

778-784: Consider adding error handling for cache update operations.

The method correctly validates write access before updating the cache, but the cache update operation could potentially fail. Consider wrapping the cache update in a Fox for proper error handling.

 def updateCache(datasetId: String): Action[DataSource] =
   Action.async(validateJson[DataSource]) { implicit request =>
     accessTokenService.validateAccessFromTokenContext(UserAccessRequest.writeDataset(datasetId)) {
-      datasetCache.updateById(datasetId, request.body)
-      Future.successful(Ok)
+      for {
+        _ <- datasetCache.updateById(datasetId, request.body)
+      } yield Ok
     }
   }
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (1)

17-17: Consider adding cache configuration

The TemporaryStore cache has no visible configuration for TTL, size limits, or eviction policies. This could lead to unbounded memory usage in production.

Consider adding configuration parameters for:

  • Maximum cache size
  • Time-to-live (TTL) for cached entries
  • Eviction policy when cache is full
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKDatasetController.scala (1)

75-101: Consider using a request object to reduce parameter count.

The method has 10 parameters which makes it error-prone to call. Consider creating a dedicated request case class to encapsulate these parameters, similar to how RawCuboidRequest is used in the POST variant.

Example structure:

case class RawCuboidGetRequest(
  x: Int, y: Int, z: Int,
  width: Int, height: Int, depth: Int,
  mag: String,
  halfByte: Boolean = false,
  mappingName: Option[String] = None
)
app/models/dataset/Dataset.scala (2)

851-866: Consider adding error context for axis order parsing failures.

While the current implementation safely handles parsing failures by returning None, it silently discards parsing errors. Consider logging warnings when axis order JSON parsing fails to help with debugging data issues.

-      axisOrderParsed = row.axisorder match {
-        case Some(axisOrder) => JsonHelper.parseAs[AxisOrder](axisOrder).toOption
-        case None            => None
-      }
+      axisOrderParsed = row.axisorder match {
+        case Some(axisOrder) => JsonHelper.parseAs[AxisOrder](axisOrder) match {
+          case Full(parsed) => Some(parsed)
+          case failure =>
+            logger.warn(s"Failed to parse axis order for dataset ${row._Dataset}, layer ${row.datalayername}: $failure")
+            None
+        }
+        case None => None
+      }

1066-1082: Consider using parallel processing for attachment parsing.

The current implementation uses Fox.serialCombined which processes attachments sequentially. Since these are independent operations, consider using Fox.combined for better performance when dealing with many attachments.

   def parseAttachments(rows: List[DatasetLayerAttachmentsRow]): Fox[AttachmentWrapper] =
     for {
-      meshFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.mesh.toString))(parseRow)
-      agglomerateFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.agglomerate.toString))(
-        parseRow)
-      connectomeFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.connectome.toString))(parseRow)
-      segmentIndexFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.segmentIndex.toString))(
-        parseRow)
-      cumsumFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.cumsum.toString))(parseRow)
+      meshFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.mesh.toString).map(parseRow))
+      agglomerateFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.agglomerate.toString).map(parseRow))
+      connectomeFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.connectome.toString).map(parseRow))
+      segmentIndexFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.segmentIndex.toString).map(parseRow))
+      cumsumFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.cumsum.toString).map(parseRow))
     } yield
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a410e8f and e64fd7b.

📒 Files selected for processing (17)
  • app/controllers/UserTokenController.scala (2 hunks)
  • app/controllers/WKRemoteDataStoreController.scala (1 hunks)
  • app/models/dataset/Dataset.scala (10 hunks)
  • app/models/dataset/DatasetService.scala (4 hunks)
  • app/models/dataset/WKRemoteDataStoreClient.scala (2 hunks)
  • conf/webknossos.latest.routes (1 hunks)
  • frontend/javascripts/admin/rest_api.ts (1 hunks)
  • frontend/javascripts/viewer/model/bucket_data_handling/wkstore_adapter.ts (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/BinaryDataController.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (4 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKDatasetController.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala (3 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (1 hunks)
  • webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala (1)
Learnt from: frcroth
PR: scalableminds/webknossos#8598
File: webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala:89-95
Timestamp: 2025-06-02T09:49:51.030Z
Learning: In WebKnossos dataset layer attachments, multiple file types can safely use the same directory name (like "agglomerates") because the scanning logic filters by file extension. For example, AgglomerateFileInfo scans for .hdf5 files while CumsumFileInfo scans for .json files in the same "agglomerates" directory without interference.
🧬 Code Graph Analysis (2)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (3)
app/models/job/Job.scala (1)
  • datasetId (55-55)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DataSource.scala (2)
  • DataSourceId (10-12)
  • DataSourceId (14-31)
app/models/dataset/Dataset.scala (1)
  • dataSourceId (92-92)
app/models/dataset/Dataset.scala (5)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datareaders/AxisOrder.scala (3)
  • AxisOrder (8-23)
  • AxisOrder (25-46)
  • toString (53-53)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DataLayer.scala (1)
  • DataFormat (30-32)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala (7)
  • DatasetLayerAttachments (12-21)
  • DatasetLayerAttachments (23-25)
  • LayerAttachment (37-37)
  • LayerAttachment (39-60)
  • LayerAttachmentDataFormat (27-30)
  • LayerAttachmentType (32-35)
  • isEmpty (19-21)
util/src/main/scala/com/scalableminds/util/tools/JsonHelper.scala (1)
  • JsonHelper (17-112)
util/src/main/scala/com/scalableminds/util/tools/Fox.scala (1)
  • combined (84-96)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: backend-tests
🔇 Additional comments (25)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/BinaryDataController.scala (1)

41-41: Approve removal of trailing comma.

This is a purely stylistic cleanup in the constructor parameter list and does not affect any logic or behavior.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala (1)

37-37: LGTM! Proper dependency injection setup.

The DatasetCache binding follows the established pattern and correctly registers as an eager singleton, which is appropriate for a cache service.

conf/webknossos.latest.routes (1)

113-113: LGTM! Well-structured route definition.

The new route follows REST conventions and integrates properly with existing datastore routes. Parameter types correctly match the controller method signature.

app/controllers/WKRemoteDataStoreController.scala (1)

262-271: Verify access control approach for dataset retrieval.

The method uses GlobalAccessContext to retrieve datasets, which bypasses user-specific access controls. Confirm this is intentional for datastore-to-datastore communication.

If this endpoint should respect dataset permissions, consider using an appropriate access context:

 def getDataset(name: String, key: String, datasetId: ObjectId): Action[AnyContent] =
   Action.async { implicit request =>
     dataStoreService.validateAccess(name, key) { _ =>
       for {
-        dataset <- datasetDAO.findOne(datasetId)(GlobalAccessContext)
+        // Consider if dataset-level access control is needed here
+        dataset <- datasetDAO.findOne(datasetId)(GlobalAccessContext)
         dataSource <- datasetService.fullDataSourceFor(dataset)
       } yield Ok(Json.toJson(dataSource))
     }
   }

Verify with the team whether datastore authentication is sufficient or if additional dataset-level authorization is required.

app/models/dataset/WKRemoteDataStoreClient.scala (2)

20-20: LGTM!

The import addition is necessary for the new updateDatasetInDSCache method and follows proper import organization.


119-125: LGTM!

The new cache update method is well-implemented:

  • Follows the established pattern used by other methods in the class
  • Uses appropriate authentication with webknossosToken
  • The URL path aligns with the new dataset ID-based routing mentioned in the PR objectives
  • Proper return type Fox[Unit] consistent with other cache operations
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (2)

15-15: LGTM!

The expanded import correctly adds the necessary types for the new getDataset method while maintaining clean import organization.


196-199: LGTM!

The new dataset retrieval method is well-implemented:

  • Follows established patterns used throughout the class
  • Uses proper authentication with dataStoreKey
  • The endpoint URL aligns with the new dataset ID-based API design
  • Appropriate return type Fox[GenericDataSource[DataLayer]] for dataset operations
  • Includes meaningful error message for debugging
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)

43-43: LGTM: Clean dependency injection addition.

The DatasetCache dependency is properly injected and will be used by the new updateCache method.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala (2)

19-20: LGTM: Useful utility method with correct implementation.

The isEmpty method properly checks all attachment collections and options, providing a clean way to determine if attachments are present.


27-27: LGTM: Consistent naming improvement.

The renaming from LayerAttachmentDataformat to LayerAttachmentDataFormat improves consistency with proper capitalization conventions. All references have been updated correctly throughout the file.

Also applies to: 37-37, 44-44, 64-64, 72-72, 80-80, 88-88, 96-96

app/controllers/UserTokenController.scala (2)

99-101: LGTM: Proper delegation to dataset access handler.

The pattern match correctly delegates dataset access requests to the new handleDataSetAccess method, using the directoryName as the dataset ID parameter.


164-190: LGTM: Well-structured dataset access control implementation.

The method follows the established pattern for access control:

  • Properly converts string ID to ObjectId
  • Implements separate read and write access logic
  • Uses appropriate database contexts for access validation
  • Handles only supported access modes (read/write)

The implementation is consistent with the existing handleDataSourceAccess method.

webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (2)

23-23: LGTM: Logical extension of access resource types.

Adding dataset to the enumeration properly extends the access control framework to support dataset-level access validation.


42-43: LGTM: Consistent factory method implementation.

The readDataset and writeDataset factory methods follow the established pattern:

  • Use the datasetId as the directoryName in DataSourceId
  • Empty organizationId (consistent with other ID-based access requests)
  • Proper access modes for the respective operations

This maintains consistency with existing access request patterns.

Also applies to: 48-49

webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes (1)

15-20: Well-structured dataset-centric routes

The new /wkDatasets/ routes provide a clean API for accessing datasets by ID. The distinct prefix prevents ambiguity with existing organization-based routes, and the consistent structure with existing endpoints makes the API intuitive.

Also applies to: 127-127

webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKDatasetController.scala (6)

53-73: LGTM! Well-structured endpoint with proper access control and monitoring.

The method correctly validates access, handles errors gracefully, and includes performance monitoring for slow requests.


103-112: Clean implementation using request object.

This POST variant properly uses a request object pattern, which is cleaner than the GET variant with many parameters.


114-164: Robust thumbnail generation with good error handling.

The implementation includes proper fallback handling for empty data (lines 155-157) and comprehensive parameter validation. The JPEG generation pipeline is well-structured.


166-175: LGTM!

Simple and effective implementation for finding data positions.


177-188: Well-implemented with informative error messages.

The method provides clear error messages for both missing data sources and histogram generation failures.


190-198: LGTM!

Clean helper method that properly transforms requests and delegates to the binary data service.

app/models/dataset/Dataset.scala (3)

868-873: LGTM! Clear implementation with helpful context.

The method correctly retrieves all magnifications and the comment about WKW datasets provides important context for maintainers.


886-956: Well-integrated attachment support for dataset layers.

The implementation correctly extends layer parsing to include attachments, data format, and channel count. Error messages are descriptive and will help with debugging.


1092-1118: LGTM! Clean implementation of attachment updates.

The method properly handles clearing old attachments before inserting new ones within a transaction, ensuring data consistency.

Comment on lines 306 to 483
private def findLayersForDatasetWithMags(datasetId: ObjectId): Fox[List[DataLayer]] =
for {
layers <- datasetDataLayerDAO.findAllForDataset(datasetId)
_ <- Fox.fromBool(!layers.flatMap(_.dataFormatOpt).contains(DataFormat.wkw)) ?~> "WKW data format not supported in this context, only datasets with MagLocators are supported"
layerNamesAndMags <- datasetMagsDAO.findAllByDatasetId(datasetId)
layersWithMags = layers.map { layer =>
val mags = layerNamesAndMags.filter(_._1 == layer.name).map(_._2).toList
layer match {
case AbstractDataLayer(name,
category,
boundingBox,
_,
elementClass,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
additionalAxes,
attachmentsOpt,
_,
numChannels,
dataFormat,
_) =>
dataFormat match {
case Some(df) =>
df match {
case DataFormat.wkw =>
throw new NotImplementedError(
"WKW data format not supported in this context, only datasets with MagLocators are supported")
case DataFormat.neuroglancerPrecomputed =>
PrecomputedDataLayer(
name,
boundingBox,
category,
elementClass,
mags,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt
)
case DataFormat.n5 =>
N5DataLayer(
name,
category,
boundingBox,
elementClass,
mags,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt
)
case DataFormat.zarr =>
ZarrDataLayer(
name,
category,
boundingBox,
elementClass,
mags,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt,
df
)
case DataFormat.zarr3 =>
Zarr3DataLayer(
name,
category,
boundingBox,
elementClass,
mags,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt,
)
}
case None => ???
}
case AbstractSegmentationLayer(name,
_,
boundingBox,
_,
elementClass,
largestSegmentId,
mappings,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
additionalAxes,
attachmentsOpt,
_,
numChannels,
dataFormat,
_) =>
dataFormat match {
case Some(df) =>
df match {
case DataFormat.wkw =>
throw new NotImplementedError(
"WKW data format not supported in this context, only datasets with MagLocators are supported")
case DataFormat.neuroglancerPrecomputed =>
PrecomputedSegmentationLayer(
name,
boundingBox,
elementClass,
mags,
largestSegmentId,
mappings,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt
)
case DataFormat.n5 =>
N5SegmentationLayer(
name,
boundingBox,
elementClass,
mags,
largestSegmentId,
mappings,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt
)
case DataFormat.zarr =>
ZarrSegmentationLayer(
name,
boundingBox,
elementClass,
mags,
largestSegmentId,
mappings,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt,
df
)
case DataFormat.zarr3 =>
Zarr3SegmentationLayer(
name,
boundingBox,
elementClass,
mags,
largestSegmentId,
mappings,
defaultViewConfiguration,
adminViewConfiguration,
coordinateTransformations,
numChannels,
additionalAxes,
attachmentsOpt
)
}
case None => ???
}
case _ => throw new NotImplementedError("DataLayer type mismatch (unreachable)")
}
}
} yield layersWithMags
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor duplicated pattern matching logic

This method contains extensive code duplication between the AbstractDataLayer and AbstractSegmentationLayer cases. Each data format is handled almost identically for both layer types.

Consider extracting the common logic into helper methods:

private def createDataLayerForFormat(
  dataFormat: DataFormat,
  name: String,
  boundingBox: BoundingBox,
  elementClass: ElementClass,
  mags: List[MagInfo],
  // ... other common parameters
): DataLayer = dataFormat match {
  case DataFormat.neuroglancerPrecomputed => 
    PrecomputedDataLayer(name, boundingBox, ...)
  case DataFormat.n5 => 
    N5DataLayer(name, ...)
  // ... other formats
}

private def createSegmentationLayerForFormat(
  dataFormat: DataFormat,
  // ... parameters
): DataLayer = dataFormat match {
  // Similar pattern for segmentation layers
}
🤖 Prompt for AI Agents
In app/models/dataset/DatasetService.scala between lines 306 and 483, the
pattern matching logic for handling data formats in AbstractDataLayer and
AbstractSegmentationLayer cases is duplicated. To fix this, extract the common
logic for creating data layers and segmentation layers into separate helper
methods that take the data format and relevant parameters, then return the
appropriate DataLayer instance. Replace the duplicated match blocks with calls
to these helper methods to improve code maintainability and reduce redundancy.

private def findLayersForDatasetWithMags(datasetId: ObjectId): Fox[List[DataLayer]] =
for {
layers <- datasetDataLayerDAO.findAllForDataset(datasetId)
_ <- Fox.fromBool(!layers.flatMap(_.dataFormatOpt).contains(DataFormat.wkw)) ?~> "WKW data format not supported in this context, only datasets with MagLocators are supported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistent error handling for WKW format

The method uses Fox.fromBool to check for WKW format at line 309, but then throws NotImplementedError in the pattern matching. This creates two different error paths for the same condition.

Either rely on the initial check:

-case DataFormat.wkw =>
-  throw new NotImplementedError(
-    "WKW data format not supported in this context, only datasets with MagLocators are supported")
+case DataFormat.wkw =>
+  // This case should be unreachable due to the check at line 309
+  Fox.failure("Internal error: WKW format should have been filtered out")

Or remove the initial check and handle it consistently in the pattern matching.

Also applies to: 331-333, 413-415

🤖 Prompt for AI Agents
In app/models/dataset/DatasetService.scala at line 309 and similarly at lines
331-333 and 413-415, the error handling for the WKW data format is inconsistent
because it uses Fox.fromBool for an initial check but then throws
NotImplementedError later. To fix this, choose one consistent approach: either
keep the Fox.fromBool check and remove the NotImplementedError throws in the
pattern matching, or remove the Fox.fromBool check and handle the error
uniformly with NotImplementedError in the pattern matching. Apply the chosen
approach consistently across all mentioned lines.

Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for delving into this, good stuff!

I left a couple of comments, mostly on caching.

Testing did not go all that well for me yet, maybe I’m doing some things wrong. I switched the dataset id in the ts files for one that exists in my db and tried to open a dataset, but I only get black.

In the backend logging I see a lot of failed RPCs with 403 "DataStore not found." as well as some failed PUT requests with {"obj.scale":[{"msg":["error.path.missing"],"args":[]}],"obj.dataLayers":[{"msg":["error.path.missing"],"args":[]}]}

If everything works for you, maybe we could debug on my machine together on monday.

I agree that we should attempt to minimize code duplication between the implementation of the datastore routes by id and those by organizationId+datasetDirectoryName. I hope we can get rid of the old access pattern at some point, but that may take several months.

case Some(df) =>
df match {
case DataFormat.wkw =>
throw new NotImplementedError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don’t mix exceptions with box/fox error handling. If it is too complex to return Box in the match cascades, please at least wrap it with tryo so the exceptions are ultimately turned into boxes so they can bubble up properly.

object AccessResourceType extends ExtendedEnumeration {
type AccessResourceType = Value
val datasource, tracing, annotation, webknossos, jobExport = Value
val datasource, dataset, tracing, annotation, webknossos, jobExport = Value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm again naming things is hard 😬 I don’t really have a clean solution either. I hope we can just get rid of the traditional/datasource codepaths soon.

@Named("webknossos-datastore") val actorSystem: ActorSystem)(implicit ec: ExecutionContext)
extends FoxImplicits {

val cache = new TemporaryStore[ObjectId, DataSource](actorSystem)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing this with a TemporaryStore and explicit get and insert solves this. However, if many requests pour in that trigger this codepath, they will all be cache misses at first, so a lot of requests to the remote webknossos will be caused.

Instead, how about using the AlfuCache pattern, which is actually asynchronous and caches the Futures?

This would, however, require a few changes. The AlfuCache Pattern is getOrLoad, so there is no explicit insert.

But I think this could work. Instead of explicit inserts, you could just clear the existing entries to invalidate them. Then the next getOrLoad will re-fill the cache with the updated version. What do you think?

@fm3
Copy link
Member

fm3 commented Jun 5, 2025

I talked to norman again and yes, we want to eventually only use the byId routes in the datastore. So the next steps (possibly in a follow-up PR) would be to duplicate all relevant datastore routes, and implement this for all dataset types, as well as switch the frontend to use only the new routes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
app/models/dataset/Dataset.scala (1)

890-905: 🛠️ Refactor suggestion

N+1 query pattern when fetching attachments

Inside parseRow we call datasetLayerAttachmentsDAO.findAll… for every layer.
For datasets with many layers this spawns one extra DB round-trip per layer.

Consider bulk-loading all attachments once and grouping them in memory, or extending the original layer query with a LEFT JOIN to fetch attachments in a single shot.

♻️ Duplicate comments (2)
app/models/dataset/DatasetService.scala (2)

309-336: 🛠️ Refactor suggestion

Inconsistent WKW-format handling persists

Past review already flagged that we both (a) pre-filter WKW via Fox.fromBool and (b) still throw a NotImplementedError in the pattern-match.
Having two independent guards for the same condition is brittle and violates the “single source of truth” principle.

Either drop the Fox.fromBool guard or replace the NotImplementedError with a Fox.failure, but use exactly one style consistently.


315-392: 🛠️ Refactor suggestion

Large duplication between colour & segmentation logic

The two pattern-matches are ~100 lines each and differ only in the concrete layer constructor. This was pointed out earlier and is still unresolved.

Please extract a helper that maps (df, layerType)DataLayer to eliminate duplication and ease future maintenance.

Also applies to: 448-478

🧹 Nitpick comments (3)
app/models/dataset/DatasetService.scala (1)

488-493: Network failure from datastore notification is unchecked

updateDatasetInDSCache performs an HTTP request but its failure is ignored.
Consider wrapping the call in Fox.tryo and bubbling up / logging the error so that silent cache desynchronisation does not occur.

app/models/dataset/Dataset.scala (2)

904-905: Minor: redundant lambda

row.dataformat.flatMap(df => DataFormat.fromString(df))

can be shortened to

row.dataformat.flatMap(DataFormat.fromString)

1066-1083: parseAttachments repeatedly filters the same list

Each attachment type filters rows anew, producing five passes over the same collection.

A quick micro-refactor:

- meshFiles         <- Fox.serialCombined(rows.filter(_.`type` == ...))(parseRow)
+ val grouped = rows.groupBy(_.`type`)
+ meshFiles         <- Fox.serialCombined(grouped.getOrElse(LayerAttachmentType.mesh.toString, Nil))(parseRow)

reduces complexity with no behaviour change.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e64fd7b and b8ebc50.

📒 Files selected for processing (10)
  • app/controllers/WKRemoteDataStoreController.scala (1 hunks)
  • app/models/dataset/Dataset.scala (10 hunks)
  • app/models/dataset/DatasetService.scala (5 hunks)
  • app/models/dataset/WKRemoteDataStoreClient.scala (2 hunks)
  • conf/webknossos.latest.routes (1 hunks)
  • frontend/javascripts/admin/rest_api.ts (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (4 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (2 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (1 hunks)
  • webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • conf/webknossos.latest.routes
  • app/controllers/WKRemoteDataStoreController.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
  • app/models/dataset/WKRemoteDataStoreClient.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
  • frontend/javascripts/admin/rest_api.ts
  • webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-smoketest-push
  • GitHub Check: backend-tests

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
app/models/dataset/DatasetService.scala (2)

307-488: Large format-switch duplication persists

The lengthy match cascades for data/segmentation layers replicate the logic
earlier flagged as duplicate. Extracting factory helpers (as already proposed
in a previous review) would remove this redundancy and simplify future
maintenance.


310-336: Redundant WKW checks

A top-level Fox.fromBool already rejects DataFormat.wkw; keeping an inner
case DataFormat.wkw => throw … branch is superfluous and adds noise.

Also applies to: 415-418

🧹 Nitpick comments (2)
app/models/dataset/DatasetService.scala (2)

293-306: fullDataSourceFor largely copies dataSourceFor – DRY it up

Both methods are identical except for the layer-retrieval call. Extract the
common part into a small helper:

private def buildSource(dataset: Dataset,
                        layersF: => Fox[List[DataLayer]]): Fox[InboxDataSource] = {
  for {
    layers <- layersF
    id      = DataSourceId(dataset.directoryName, dataset._organization)
    scale  <- dataset.voxelSize.toFox ?~> "dataset.source.usableButNoScale"
  } yield GenericDataSource[DataLayer](id, layers, scale)
}

def dataSourceFor(dataset: Dataset)  = buildSource(dataset,
  datasetDataLayerDAO.findAllForDataset(dataset._id))

def fullDataSourceFor(dataset: Dataset) = buildSource(dataset,
  findLayersForDatasetWithMags(dataset._id))

Keeps behaviour identical while cutting duplication.


394-395: Return a Fox.failure instead of throwing

Throwing IllegalStateException here is caught by tryo, but causes verbose
stack-traces and loses the chance to provide an i18n key. Prefer:

case None =>
  Fox.failure("dataset.layer.missingDataFormat")

for clearer error semantics and cleaner logs.

Also applies to: 480-482

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8ebc50 and bdc75cc.

📒 Files selected for processing (2)
  • app/models/dataset/DatasetService.scala (5 hunks)
  • app/models/dataset/WKRemoteDataStoreClient.scala (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/models/dataset/WKRemoteDataStoreClient.scala
🧰 Additional context used
🧠 Learnings (1)
app/models/dataset/DatasetService.scala (1)
Learnt from: frcroth
PR: scalableminds/webknossos#8657
File: app/models/dataset/DatasetService.scala:380-392
Timestamp: 2025-06-16T09:04:40.565Z
Learning: Trailing commas are supported in Scala starting from version 2.12.2 as part of SIP-27. They are valid in function calls, function definitions, import statements, and other contexts. This means trailing commas after the last parameter or argument do not cause compilation errors in Scala 2.12.2+.
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-smoketest-push
  • GitHub Check: backend-tests

Comment on lines +216 to 217
_ <- notifyDatastoreOnUpdate(foundDataset._id)
} yield foundDataset._id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Remote cache invalidation can make the entire update fail

notifyDatastoreOnUpdate is executed inside the transactional flow of
updateDataSource*. A transient network error when calling the remote
datastore would roll the whole DB update back, although only the cache could
not be refreshed.

Consider decoupling this call (e.g. run it in a Future/fork or dispatch a
message that can be retried) so the primary write succeeds even if the
datastore is unreachable, and log failures for later inspection.

Also applies to: 238-239


I’ll examine how notifyDatastoreOnUpdate is invoked relative to the transaction boundaries. Retrieving the updateDataSource* definitions, their transaction usage, and the call sites:


🏁 Script executed:

#!/bin/bash
set -e

# Locate updateDataSource methods
echo "=== updateDataSource definitions ==="
rg -n "def updateDataSource" app/models/dataset/DatasetService.scala

# Show surrounding lines of these methods
echo -e "\n=== Context around updateDataSource ==="
rg -n -C5 "updateDataSource" app/models/dataset/DatasetService.scala

# Locate notifyDatastoreOnUpdate calls
echo -e "\n=== notifyDatastoreOnUpdate usages ==="
rg -n -C3 "notifyDatastoreOnUpdate" app/models/dataset/DatasetService.scala

# Check for transaction boundaries (e.g., .transactionally or similar)
echo -e "\n=== Transaction markers ==="
rg -n "transactionally" app/models/dataset/DatasetService.scala || echo "No explicit .transactionally found"

Length of output: 6057


Decouple notifyDatastoreOnUpdate from the transactional flow

The call to notifyDatastoreOnUpdate(foundDataset._id) is currently inside the same for-comprehension (and thus the same transactional/monadic flow) as your database updates. Any transient failure contacting the remote datastore will cause the entire DB update to roll back. To ensure the primary write always succeeds:

• Move the cache-invalidation call outside the DB transaction or wrap it in a detached Future/background job.
• Log and retry failures independently of the DB update.

Locations to update:

  • app/models/dataset/DatasetService.scala: lines 210–217
  • app/models/dataset/DatasetService.scala: lines 232–239
🤖 Prompt for AI Agents
In app/models/dataset/DatasetService.scala around lines 210 to 217 and 232 to
239, the call to notifyDatastoreOnUpdate is currently inside the transactional
flow, causing the entire database update to roll back if the remote cache
invalidation fails. To fix this, move the notifyDatastoreOnUpdate call outside
the transaction or run it asynchronously in a detached Future or background job,
ensuring the primary database update always succeeds. Additionally, add logging
and retry mechanisms for failures in cache invalidation to handle transient
errors without affecting the main transaction.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationSource.scala (1)

6-13: New mandatory datasetId will break deserialization of existing documents

datasetId is introduced as a required ObjectId.
All persisted AnnotationSource JSON documents that were written before this PR (or any client payloads not yet updated) lack this field, so:

  1. Json.parse(..).as[AnnotationSource] will now throw JsResultException.
  2. Play will return 400 for incoming requests that omit the property.

Unless a full data-migration + coordinated client rollout is guaranteed, make the field optional or supply a custom Reads with a sensible fallback.

-                            datasetId: ObjectId,
+                            datasetId: Option[ObjectId],

and later:

implicit val jsonFormat: OFormat[AnnotationSource] =
  Json.format[AnnotationSource]   // Option[...] keeps backward compat

If the field must be mandatory, add a DB migration script and bump API version so that older clients fail fast and predictably.
Also double-check an implicit OFormat[ObjectId] is in scope; otherwise compilation will fail here.

🧹 Nitpick comments (2)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala (2)

50-53: Re-evaluate blanket CORS enablement

allowRemoteOrigin = true exposes every endpoint to any origin.
Given that these routes hand out arbitrary binary data, you probably want a narrower CORS policy (e.g. configured origins, credentials flag, etc.) instead of a global wildcard.


320-344: Private-link helpers silently fall back to datastore route

zArrayPrivateLink / zarrJsonPrivateLink default to datastore data when the layer isn’t found in the annotation source.
That makes debugging access mistakes tricky.

Return 404 (or at least log a warning) when the caller explicitly asked for an annotation layer that does not exist instead of silently changing data source.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdc75cc and 2196b2f.

📒 Files selected for processing (5)
  • app/models/annotation/AnnotationService.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationSource.scala (1 hunks)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (1 hunks)
  • webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • app/models/annotation/AnnotationService.scala
🚧 Files skipped from review as they are similar to previous changes (2)
  • webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
  • webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes
🧰 Additional context used
🧬 Code Graph Analysis (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala (15)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (4)
  • webknossos (66-68)
  • validateAccessFromTokenContext (83-90)
  • readDataset (42-45)
  • validateAccessFromTokenContextForSyncBlock (77-83)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/dataformats/zarr/ZarrCoordinatesParser.scala (2)
  • ZarrCoordinatesParser (11-47)
  • parseNDimensionalDotCoordinates (13-46)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/Positions.scala (2)
  • VoxelPosition (8-44)
  • topLeft (70-76)
frontend/javascripts/types/api_types.ts (1)
  • AnnotationLayerType (411-411)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationSource.scala (3)
  • AnnotationSource (6-15)
  • AnnotationSource (17-19)
  • getAnnotationLayer (14-14)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/requests/Cuboid.scala (1)
  • Cuboid (9-60)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/requests/DataServiceRequests.scala (1)
  • DataServiceDataRequest (18-30)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (2)
  • getWithLayer (25-29)
  • getById (22-23)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteTracingstoreClient.scala (7)
  • getVolumeLayerAsZarrLayer (40-48)
  • getRawZarrCube (59-63)
  • getZArray (31-35)
  • getZarrJson (35-40)
  • getDataLayerMagDirectoryContents (63-68)
  • getDataLayerDirectoryContents (68-73)
  • getZGroup (73-76)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1)
  • getAnnotationSource (172-182)
util/src/main/scala/com/scalableminds/util/mvc/ExtendedController.scala (1)
  • tokenContextForRequest (226-228)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationLayer.scala (1)
  • typ (76-79)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/BinaryDataService.scala (1)
  • handleDataRequests (159-178)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datareaders/zarr/ZarrHeader.scala (1)
  • fromLayer (59-85)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datareaders/zarr3/Zarr3ArrayHeader.scala (1)
  • fromDataLayer (262-295)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: backend-tests
  • GitHub Check: build-smoketest-push

Comment on lines 154 to 189
private def convertLayerToZarrLayer(layer: DataLayer, zarrVersion: Int): ZarrLayer = {
val dataFormat = if (zarrVersion == 2) DataFormat.zarr else DataFormat.zarr3
layer match {
case s: SegmentationLayer =>
val rank = s.additionalAxes.map(_.length).getOrElse(0) + 4 // We’re writing c, additionalAxes, xyz
ZarrSegmentationLayer(
s.name,
s.boundingBox,
s.elementClass,
mags = s.sortedMags.map(x => MagLocator(x, None, None, Some(AxisOrder.cAdditionalxyz(rank)), None, None)),
mappings = s.mappings,
largestSegmentId = s.largestSegmentId,
numChannels = Some(if (s.elementClass == ElementClass.uint24) 3 else 1),
defaultViewConfiguration = s.defaultViewConfiguration,
adminViewConfiguration = s.adminViewConfiguration,
coordinateTransformations = s.coordinateTransformations,
additionalAxes = s.additionalAxes.map(reorderAdditionalAxes),
dataFormat = dataFormat
)
case d: DataLayer =>
val rank = d.additionalAxes.map(_.length).getOrElse(0) + 4 // We’re writing c, additionalAxes, xyz
ZarrDataLayer(
d.name,
d.category,
d.boundingBox,
d.elementClass,
mags = d.sortedMags.map(x => MagLocator(x, None, None, Some(AxisOrder.cAdditionalxyz(rank)), None, None)),
numChannels = Some(if (d.elementClass == ElementClass.uint24) 3 else 1),
defaultViewConfiguration = d.defaultViewConfiguration,
adminViewConfiguration = d.adminViewConfiguration,
coordinateTransformations = d.coordinateTransformations,
additionalAxes = d.additionalAxes.map(reorderAdditionalAxes),
dataFormat = dataFormat
)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Duplicate layer-conversion code – centralise and test

convertLayerToZarrLayer is another copy of logic already living in the original controller.
Apart from maintenance concerns, the two versions must stay byte-identical to avoid subtle divergences (e.g. rank calculation, element-class channel logic).

Move this transformation into a single helper (maybe inside ZarrLayerFactory) and unit-test it once.

🤖 Prompt for AI Agents
In
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala
between lines 154 and 189, the convertLayerToZarrLayer method duplicates logic
present elsewhere, risking inconsistencies. Refactor by extracting this
conversion logic into a single helper method, ideally within a dedicated object
like ZarrLayerFactory, and replace all existing copies with calls to this
helper. Add unit tests for this helper to ensure correctness and
maintainability.

Comment on lines 34 to 38
/**
*
* This is a copy of the ZarrStreamingController using datasetId instead of organizationId/datasetDirectoryName
*/
class WKZarrStreamingController @Inject()(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid large-scale copy-paste – extract shared logic

The whole controller is a near verbatim copy of ZarrStreamingController with only the lookup key changed.
This will double maintenance effort and risks the two controllers diverging silently.

Consider extracting the common logic into reusable services / mix-ins and keep the two tiny façade controllers (or, once migration is finished, just one controller).
This also makes upcoming “duplicate all routes” work trivial.

🤖 Prompt for AI Agents
In
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala
around lines 34 to 38, the WKZarrStreamingController is almost an exact copy of
ZarrStreamingController except for the lookup key. To fix this, refactor by
extracting the shared logic into a common service or trait that both controllers
can use, then simplify each controller to just handle the specific lookup key
differences. This reduces duplication, eases maintenance, and prevents
divergence between the two controllers.

Comment on lines 241 to 276
private def rawZarrCube(
datasetId: String,
dataLayerName: String,
mag: String,
coordinates: String,
)(implicit m: MessagesProvider, tc: TokenContext): Fox[Result] =
for {
datasetIdValidated <- ObjectId.fromString(datasetId)
(dataSource, dataLayer) <- datasetCache.getWithLayer(datasetIdValidated, dataLayerName) ~> SERVICE_UNAVAILABLE
reorderedAdditionalAxes = dataLayer.additionalAxes.map(reorderAdditionalAxes)
(x, y, z, additionalCoordinates) <- ZarrCoordinatesParser.parseNDimensionalDotCoordinates(
coordinates,
reorderedAdditionalAxes) ?~> "zarr.invalidChunkCoordinates" ~> NOT_FOUND
magParsed <- Vec3Int
.fromMagLiteral(mag, allowScalar = true)
.toFox ?~> Messages("dataLayer.invalidMag", mag) ~> NOT_FOUND
_ <- Fox.fromBool(dataLayer.containsMag(magParsed)) ?~> Messages("dataLayer.wrongMag", dataLayerName, mag) ~> NOT_FOUND
cubeSize = DataLayer.bucketLength
request = DataServiceDataRequest(
Some(dataSource.id),
dataLayer,
Cuboid(
topLeft = VoxelPosition(x * cubeSize * magParsed.x,
y * cubeSize * magParsed.y,
z * cubeSize * magParsed.z,
magParsed),
width = cubeSize,
height = cubeSize,
depth = cubeSize
),
DataServiceRequestSettings(halfByte = false, additionalCoordinates = additionalCoordinates)
)
(data, notFoundIndices) <- binaryDataService.handleDataRequests(List(request))
_ <- Fox.fromBool(notFoundIndices.isEmpty) ~> "zarr.chunkNotFound" ~> NOT_FOUND
} yield Ok(data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Binary cube response lacks content type & may exhaust memory

rawZarrCube streams the full chunk into a byte array and returns Ok(data) without setting Content-Type.
Two issues:

  1. Clients cannot rely on MIME type – send application/octet-stream.
  2. Holding the whole chunk (~134 MB for 512³ uint8) in memory before sending can blow the heap under load.
    Use Play’s Chunked or HttpEntity.Streamed with an Akka Source to stream from binaryDataService.

Example patch:

-    } yield Ok(data)
+    } yield
+      Ok.sendEntity(
+        HttpEntity.Strict(ByteString(data), Some("application/octet-stream"))
+      )

(Or switch to a streamed entity.)

🤖 Prompt for AI Agents
In
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala
around lines 241 to 276, the rawZarrCube method returns the entire binary chunk
as a byte array with Ok(data) but does not set the Content-Type header, and it
loads the entire chunk into memory which risks heap exhaustion. Fix this by
setting the response Content-Type to "application/octet-stream" and refactor the
method to stream the data using Play's Chunked or HttpEntity.Streamed with an
Akka Source instead of loading it fully into memory before sending.

@frcroth frcroth requested a review from fm3 June 16, 2025 15:08
@frcroth
Copy link
Contributor Author

frcroth commented Jun 16, 2025

@fm3 I changed it to use an AlfuCache and duplicated the zarr routes as discussed. The other routes are all about files so not relevant to this change (remote datasets).

I also removed the frontend changes. If you want to test them again, see this commit here: f62a7d7#diff-1833767de35826d9954d2ef7d077f41f8b69861833801e63b9dd77e47d4c6d27

Copy link
Member

@fm3 fm3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I think with the AlfuCache this makes a lot of sense :) Thanks for implementing the smaller comments as well.

I’m a little unsure about the new code duplication, see comment above.

Comment on lines 18 to 20
lazy val cache: AlfuCache[ObjectId, DataSource] = AlfuCache[ObjectId, DataSource](
timeToLive = 1 day // Cache for a longer time, since we invalidate the cache manually
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to increase the capacity here, default is 1k. How about 5k? DataSource objects should not use up all that much memory.

Comment on lines 45 to 48
extends Controller
with Zarr3OutputHelper {

override def defaultErrorCode: Int = NOT_FOUND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whew, that’s really a lot of duplication. @normanrz what do you think, how long after switching to the by-id routes internally do we need to support both variants to facilitate libs compatibility? If it is long, it may be worth taking the time now to extract the duplicated functionality into separate services, or possibly inject the ZarrStreamingController itself into this WkZarrStreamingController and move the shared functionality into functions accessible by both.

Copy link
Member

@fm3 fm3 Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we could go the other way round, and have the by-orgaId-and-datasetName routes do a lookup of the id, and then call the methods of the new controllers? If both lookups are cached on the datastore side, there should not be a significant performance hit for that, right? ok no I guess that will work only once all dataset types are accessible in this way, not just remote datasets. So maybe for the next iteration

@frcroth frcroth requested a review from fm3 June 23, 2025 13:38
@frcroth frcroth merged commit 6096fe4 into master Jun 25, 2025
5 checks passed
@frcroth frcroth deleted the virtual-remote-datasets branch June 25, 2025 07:26
@coderabbitai coderabbitai bot mentioned this pull request Jul 21, 2025
22 tasks
@coderabbitai coderabbitai bot mentioned this pull request Sep 12, 2025
7 tasks
@coderabbitai coderabbitai bot mentioned this pull request Sep 22, 2025
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants