-
Notifications
You must be signed in to change notification settings - Fork 29
Virtual Remote Datasets #8657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtual Remote Datasets #8657
Conversation
📝 WalkthroughWalkthroughThis update introduces enhanced dataset access control, new dataset cache and attachment management, and expanded API endpoints for dataset data access and cache invalidation. It adds support for dataset layer attachments, detailed data source representations, and integrates these features into both the main application and the datastore service, with corresponding updates to public APIs and access token handling. Changes
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (5)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
778-784
: Consider adding error handling for cache update operations.The method correctly validates write access before updating the cache, but the cache update operation could potentially fail. Consider wrapping the cache update in a Fox for proper error handling.
def updateCache(datasetId: String): Action[DataSource] = Action.async(validateJson[DataSource]) { implicit request => accessTokenService.validateAccessFromTokenContext(UserAccessRequest.writeDataset(datasetId)) { - datasetCache.updateById(datasetId, request.body) - Future.successful(Ok) + for { + _ <- datasetCache.updateById(datasetId, request.body) + } yield Ok } }webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (1)
17-17
: Consider adding cache configurationThe
TemporaryStore
cache has no visible configuration for TTL, size limits, or eviction policies. This could lead to unbounded memory usage in production.Consider adding configuration parameters for:
- Maximum cache size
- Time-to-live (TTL) for cached entries
- Eviction policy when cache is full
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKDatasetController.scala (1)
75-101
: Consider using a request object to reduce parameter count.The method has 10 parameters which makes it error-prone to call. Consider creating a dedicated request case class to encapsulate these parameters, similar to how
RawCuboidRequest
is used in the POST variant.Example structure:
case class RawCuboidGetRequest( x: Int, y: Int, z: Int, width: Int, height: Int, depth: Int, mag: String, halfByte: Boolean = false, mappingName: Option[String] = None )app/models/dataset/Dataset.scala (2)
851-866
: Consider adding error context for axis order parsing failures.While the current implementation safely handles parsing failures by returning None, it silently discards parsing errors. Consider logging warnings when axis order JSON parsing fails to help with debugging data issues.
- axisOrderParsed = row.axisorder match { - case Some(axisOrder) => JsonHelper.parseAs[AxisOrder](axisOrder).toOption - case None => None - } + axisOrderParsed = row.axisorder match { + case Some(axisOrder) => JsonHelper.parseAs[AxisOrder](axisOrder) match { + case Full(parsed) => Some(parsed) + case failure => + logger.warn(s"Failed to parse axis order for dataset ${row._Dataset}, layer ${row.datalayername}: $failure") + None + } + case None => None + }
1066-1082
: Consider using parallel processing for attachment parsing.The current implementation uses
Fox.serialCombined
which processes attachments sequentially. Since these are independent operations, consider usingFox.combined
for better performance when dealing with many attachments.def parseAttachments(rows: List[DatasetLayerAttachmentsRow]): Fox[AttachmentWrapper] = for { - meshFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.mesh.toString))(parseRow) - agglomerateFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.agglomerate.toString))( - parseRow) - connectomeFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.connectome.toString))(parseRow) - segmentIndexFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.segmentIndex.toString))( - parseRow) - cumsumFiles <- Fox.serialCombined(rows.filter(_.`type` == LayerAttachmentType.cumsum.toString))(parseRow) + meshFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.mesh.toString).map(parseRow)) + agglomerateFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.agglomerate.toString).map(parseRow)) + connectomeFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.connectome.toString).map(parseRow)) + segmentIndexFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.segmentIndex.toString).map(parseRow)) + cumsumFiles <- Fox.combined(rows.filter(_.`type` == LayerAttachmentType.cumsum.toString).map(parseRow)) } yield
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
app/controllers/UserTokenController.scala
(2 hunks)app/controllers/WKRemoteDataStoreController.scala
(1 hunks)app/models/dataset/Dataset.scala
(10 hunks)app/models/dataset/DatasetService.scala
(4 hunks)app/models/dataset/WKRemoteDataStoreClient.scala
(2 hunks)conf/webknossos.latest.routes
(1 hunks)frontend/javascripts/admin/rest_api.ts
(1 hunks)frontend/javascripts/viewer/model/bucket_data_handling/wkstore_adapter.ts
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/BinaryDataController.scala
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
(4 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKDatasetController.scala
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala
(3 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala
(2 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
(2 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
(1 hunks)webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes
(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala (1)
Learnt from: frcroth
PR: scalableminds/webknossos#8598
File: webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala:89-95
Timestamp: 2025-06-02T09:49:51.030Z
Learning: In WebKnossos dataset layer attachments, multiple file types can safely use the same directory name (like "agglomerates") because the scanning logic filters by file extension. For example, AgglomerateFileInfo scans for .hdf5 files while CumsumFileInfo scans for .json files in the same "agglomerates" directory without interference.
🧬 Code Graph Analysis (2)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (3)
app/models/job/Job.scala (1)
datasetId
(55-55)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DataSource.scala (2)
DataSourceId
(10-12)DataSourceId
(14-31)app/models/dataset/Dataset.scala (1)
dataSourceId
(92-92)
app/models/dataset/Dataset.scala (5)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/datareaders/AxisOrder.scala (3)
AxisOrder
(8-23)AxisOrder
(25-46)toString
(53-53)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DataLayer.scala (1)
DataFormat
(30-32)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala (7)
DatasetLayerAttachments
(12-21)DatasetLayerAttachments
(23-25)LayerAttachment
(37-37)LayerAttachment
(39-60)LayerAttachmentDataFormat
(27-30)LayerAttachmentType
(32-35)isEmpty
(19-21)util/src/main/scala/com/scalableminds/util/tools/JsonHelper.scala (1)
JsonHelper
(17-112)util/src/main/scala/com/scalableminds/util/tools/Fox.scala (1)
combined
(84-96)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: backend-tests
🔇 Additional comments (25)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/BinaryDataController.scala (1)
41-41
: Approve removal of trailing comma.This is a purely stylistic cleanup in the constructor parameter list and does not affect any logic or behavior.
webknossos-datastore/app/com/scalableminds/webknossos/datastore/DataStoreModule.scala (1)
37-37
: LGTM! Proper dependency injection setup.The DatasetCache binding follows the established pattern and correctly registers as an eager singleton, which is appropriate for a cache service.
conf/webknossos.latest.routes (1)
113-113
: LGTM! Well-structured route definition.The new route follows REST conventions and integrates properly with existing datastore routes. Parameter types correctly match the controller method signature.
app/controllers/WKRemoteDataStoreController.scala (1)
262-271
: Verify access control approach for dataset retrieval.The method uses
GlobalAccessContext
to retrieve datasets, which bypasses user-specific access controls. Confirm this is intentional for datastore-to-datastore communication.If this endpoint should respect dataset permissions, consider using an appropriate access context:
def getDataset(name: String, key: String, datasetId: ObjectId): Action[AnyContent] = Action.async { implicit request => dataStoreService.validateAccess(name, key) { _ => for { - dataset <- datasetDAO.findOne(datasetId)(GlobalAccessContext) + // Consider if dataset-level access control is needed here + dataset <- datasetDAO.findOne(datasetId)(GlobalAccessContext) dataSource <- datasetService.fullDataSourceFor(dataset) } yield Ok(Json.toJson(dataSource)) } }Verify with the team whether datastore authentication is sufficient or if additional dataset-level authorization is required.
app/models/dataset/WKRemoteDataStoreClient.scala (2)
20-20
: LGTM!The import addition is necessary for the new
updateDatasetInDSCache
method and follows proper import organization.
119-125
: LGTM!The new cache update method is well-implemented:
- Follows the established pattern used by other methods in the class
- Uses appropriate authentication with
webknossosToken
- The URL path aligns with the new dataset ID-based routing mentioned in the PR objectives
- Proper return type
Fox[Unit]
consistent with other cache operationswebknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (2)
15-15
: LGTM!The expanded import correctly adds the necessary types for the new
getDataset
method while maintaining clean import organization.
196-199
: LGTM!The new dataset retrieval method is well-implemented:
- Follows established patterns used throughout the class
- Uses proper authentication with
dataStoreKey
- The endpoint URL aligns with the new dataset ID-based API design
- Appropriate return type
Fox[GenericDataSource[DataLayer]]
for dataset operations- Includes meaningful error message for debugging
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala (1)
43-43
: LGTM: Clean dependency injection addition.The
DatasetCache
dependency is properly injected and will be used by the newupdateCache
method.webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/datasource/DatasetLayerAttachments.scala (2)
19-20
: LGTM: Useful utility method with correct implementation.The
isEmpty
method properly checks all attachment collections and options, providing a clean way to determine if attachments are present.
27-27
: LGTM: Consistent naming improvement.The renaming from
LayerAttachmentDataformat
toLayerAttachmentDataFormat
improves consistency with proper capitalization conventions. All references have been updated correctly throughout the file.Also applies to: 37-37, 44-44, 64-64, 72-72, 80-80, 88-88, 96-96
app/controllers/UserTokenController.scala (2)
99-101
: LGTM: Proper delegation to dataset access handler.The pattern match correctly delegates dataset access requests to the new
handleDataSetAccess
method, using the directoryName as the dataset ID parameter.
164-190
: LGTM: Well-structured dataset access control implementation.The method follows the established pattern for access control:
- Properly converts string ID to ObjectId
- Implements separate read and write access logic
- Uses appropriate database contexts for access validation
- Handles only supported access modes (read/write)
The implementation is consistent with the existing
handleDataSourceAccess
method.webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (2)
23-23
: LGTM: Logical extension of access resource types.Adding
dataset
to the enumeration properly extends the access control framework to support dataset-level access validation.
42-43
: LGTM: Consistent factory method implementation.The
readDataset
andwriteDataset
factory methods follow the established pattern:
- Use the datasetId as the directoryName in DataSourceId
- Empty organizationId (consistent with other ID-based access requests)
- Proper access modes for the respective operations
This maintains consistency with existing access request patterns.
Also applies to: 48-49
webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes (1)
15-20
: Well-structured dataset-centric routesThe new
/wkDatasets/
routes provide a clean API for accessing datasets by ID. The distinct prefix prevents ambiguity with existing organization-based routes, and the consistent structure with existing endpoints makes the API intuitive.Also applies to: 127-127
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKDatasetController.scala (6)
53-73
: LGTM! Well-structured endpoint with proper access control and monitoring.The method correctly validates access, handles errors gracefully, and includes performance monitoring for slow requests.
103-112
: Clean implementation using request object.This POST variant properly uses a request object pattern, which is cleaner than the GET variant with many parameters.
114-164
: Robust thumbnail generation with good error handling.The implementation includes proper fallback handling for empty data (lines 155-157) and comprehensive parameter validation. The JPEG generation pipeline is well-structured.
166-175
: LGTM!Simple and effective implementation for finding data positions.
177-188
: Well-implemented with informative error messages.The method provides clear error messages for both missing data sources and histogram generation failures.
190-198
: LGTM!Clean helper method that properly transforms requests and delegates to the binary data service.
app/models/dataset/Dataset.scala (3)
868-873
: LGTM! Clear implementation with helpful context.The method correctly retrieves all magnifications and the comment about WKW datasets provides important context for maintainers.
886-956
: Well-integrated attachment support for dataset layers.The implementation correctly extends layer parsing to include attachments, data format, and channel count. Error messages are descriptive and will help with debugging.
1092-1118
: LGTM! Clean implementation of attachment updates.The method properly handles clearing old attachments before inserting new ones within a transaction, ensuring data consistency.
frontend/javascripts/viewer/model/bucket_data_handling/wkstore_adapter.ts
Outdated
Show resolved
Hide resolved
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
Outdated
Show resolved
Hide resolved
private def findLayersForDatasetWithMags(datasetId: ObjectId): Fox[List[DataLayer]] = | ||
for { | ||
layers <- datasetDataLayerDAO.findAllForDataset(datasetId) | ||
_ <- Fox.fromBool(!layers.flatMap(_.dataFormatOpt).contains(DataFormat.wkw)) ?~> "WKW data format not supported in this context, only datasets with MagLocators are supported" | ||
layerNamesAndMags <- datasetMagsDAO.findAllByDatasetId(datasetId) | ||
layersWithMags = layers.map { layer => | ||
val mags = layerNamesAndMags.filter(_._1 == layer.name).map(_._2).toList | ||
layer match { | ||
case AbstractDataLayer(name, | ||
category, | ||
boundingBox, | ||
_, | ||
elementClass, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
additionalAxes, | ||
attachmentsOpt, | ||
_, | ||
numChannels, | ||
dataFormat, | ||
_) => | ||
dataFormat match { | ||
case Some(df) => | ||
df match { | ||
case DataFormat.wkw => | ||
throw new NotImplementedError( | ||
"WKW data format not supported in this context, only datasets with MagLocators are supported") | ||
case DataFormat.neuroglancerPrecomputed => | ||
PrecomputedDataLayer( | ||
name, | ||
boundingBox, | ||
category, | ||
elementClass, | ||
mags, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt | ||
) | ||
case DataFormat.n5 => | ||
N5DataLayer( | ||
name, | ||
category, | ||
boundingBox, | ||
elementClass, | ||
mags, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt | ||
) | ||
case DataFormat.zarr => | ||
ZarrDataLayer( | ||
name, | ||
category, | ||
boundingBox, | ||
elementClass, | ||
mags, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt, | ||
df | ||
) | ||
case DataFormat.zarr3 => | ||
Zarr3DataLayer( | ||
name, | ||
category, | ||
boundingBox, | ||
elementClass, | ||
mags, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt, | ||
) | ||
} | ||
case None => ??? | ||
} | ||
case AbstractSegmentationLayer(name, | ||
_, | ||
boundingBox, | ||
_, | ||
elementClass, | ||
largestSegmentId, | ||
mappings, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
additionalAxes, | ||
attachmentsOpt, | ||
_, | ||
numChannels, | ||
dataFormat, | ||
_) => | ||
dataFormat match { | ||
case Some(df) => | ||
df match { | ||
case DataFormat.wkw => | ||
throw new NotImplementedError( | ||
"WKW data format not supported in this context, only datasets with MagLocators are supported") | ||
case DataFormat.neuroglancerPrecomputed => | ||
PrecomputedSegmentationLayer( | ||
name, | ||
boundingBox, | ||
elementClass, | ||
mags, | ||
largestSegmentId, | ||
mappings, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt | ||
) | ||
case DataFormat.n5 => | ||
N5SegmentationLayer( | ||
name, | ||
boundingBox, | ||
elementClass, | ||
mags, | ||
largestSegmentId, | ||
mappings, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt | ||
) | ||
case DataFormat.zarr => | ||
ZarrSegmentationLayer( | ||
name, | ||
boundingBox, | ||
elementClass, | ||
mags, | ||
largestSegmentId, | ||
mappings, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt, | ||
df | ||
) | ||
case DataFormat.zarr3 => | ||
Zarr3SegmentationLayer( | ||
name, | ||
boundingBox, | ||
elementClass, | ||
mags, | ||
largestSegmentId, | ||
mappings, | ||
defaultViewConfiguration, | ||
adminViewConfiguration, | ||
coordinateTransformations, | ||
numChannels, | ||
additionalAxes, | ||
attachmentsOpt | ||
) | ||
} | ||
case None => ??? | ||
} | ||
case _ => throw new NotImplementedError("DataLayer type mismatch (unreachable)") | ||
} | ||
} | ||
} yield layersWithMags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicated pattern matching logic
This method contains extensive code duplication between the AbstractDataLayer
and AbstractSegmentationLayer
cases. Each data format is handled almost identically for both layer types.
Consider extracting the common logic into helper methods:
private def createDataLayerForFormat(
dataFormat: DataFormat,
name: String,
boundingBox: BoundingBox,
elementClass: ElementClass,
mags: List[MagInfo],
// ... other common parameters
): DataLayer = dataFormat match {
case DataFormat.neuroglancerPrecomputed =>
PrecomputedDataLayer(name, boundingBox, ...)
case DataFormat.n5 =>
N5DataLayer(name, ...)
// ... other formats
}
private def createSegmentationLayerForFormat(
dataFormat: DataFormat,
// ... parameters
): DataLayer = dataFormat match {
// Similar pattern for segmentation layers
}
🤖 Prompt for AI Agents
In app/models/dataset/DatasetService.scala between lines 306 and 483, the
pattern matching logic for handling data formats in AbstractDataLayer and
AbstractSegmentationLayer cases is duplicated. To fix this, extract the common
logic for creating data layers and segmentation layers into separate helper
methods that take the data format and relevant parameters, then return the
appropriate DataLayer instance. Replace the duplicated match blocks with calls
to these helper methods to improve code maintainability and reduce redundancy.
private def findLayersForDatasetWithMags(datasetId: ObjectId): Fox[List[DataLayer]] = | ||
for { | ||
layers <- datasetDataLayerDAO.findAllForDataset(datasetId) | ||
_ <- Fox.fromBool(!layers.flatMap(_.dataFormatOpt).contains(DataFormat.wkw)) ?~> "WKW data format not supported in this context, only datasets with MagLocators are supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent error handling for WKW format
The method uses Fox.fromBool
to check for WKW format at line 309, but then throws NotImplementedError
in the pattern matching. This creates two different error paths for the same condition.
Either rely on the initial check:
-case DataFormat.wkw =>
- throw new NotImplementedError(
- "WKW data format not supported in this context, only datasets with MagLocators are supported")
+case DataFormat.wkw =>
+ // This case should be unreachable due to the check at line 309
+ Fox.failure("Internal error: WKW format should have been filtered out")
Or remove the initial check and handle it consistently in the pattern matching.
Also applies to: 331-333, 413-415
🤖 Prompt for AI Agents
In app/models/dataset/DatasetService.scala at line 309 and similarly at lines
331-333 and 413-415, the error handling for the WKW data format is inconsistent
because it uses Fox.fromBool for an initial check but then throws
NotImplementedError later. To fix this, choose one consistent approach: either
keep the Fox.fromBool check and remove the NotImplementedError throws in the
pattern matching, or remove the Fox.fromBool check and handle the error
uniformly with NotImplementedError in the pattern matching. Apply the chosen
approach consistently across all mentioned lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for delving into this, good stuff!
I left a couple of comments, mostly on caching.
Testing did not go all that well for me yet, maybe I’m doing some things wrong. I switched the dataset id in the ts files for one that exists in my db and tried to open a dataset, but I only get black.
In the backend logging I see a lot of failed RPCs with 403 "DataStore not found."
as well as some failed PUT requests with {"obj.scale":[{"msg":["error.path.missing"],"args":[]}],"obj.dataLayers":[{"msg":["error.path.missing"],"args":[]}]}
If everything works for you, maybe we could debug on my machine together on monday.
I agree that we should attempt to minimize code duplication between the implementation of the datastore routes by id and those by organizationId+datasetDirectoryName. I hope we can get rid of the old access pattern at some point, but that may take several months.
case Some(df) => | ||
df match { | ||
case DataFormat.wkw => | ||
throw new NotImplementedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don’t mix exceptions with box/fox error handling. If it is too complex to return Box in the match cascades, please at least wrap it with tryo so the exceptions are ultimately turned into boxes so they can bubble up properly.
object AccessResourceType extends ExtendedEnumeration { | ||
type AccessResourceType = Value | ||
val datasource, tracing, annotation, webknossos, jobExport = Value | ||
val datasource, dataset, tracing, annotation, webknossos, jobExport = Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm again naming things is hard 😬 I don’t really have a clean solution either. I hope we can just get rid of the traditional/datasource codepaths soon.
@Named("webknossos-datastore") val actorSystem: ActorSystem)(implicit ec: ExecutionContext) | ||
extends FoxImplicits { | ||
|
||
val cache = new TemporaryStore[ObjectId, DataSource](actorSystem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing this with a TemporaryStore and explicit get and insert solves this. However, if many requests pour in that trigger this codepath, they will all be cache misses at first, so a lot of requests to the remote webknossos will be caused.
Instead, how about using the AlfuCache pattern, which is actually asynchronous and caches the Futures?
This would, however, require a few changes. The AlfuCache Pattern is getOrLoad, so there is no explicit insert.
But I think this could work. Instead of explicit inserts, you could just clear the existing entries to invalidate them. Then the next getOrLoad will re-fill the cache with the updated version. What do you think?
I talked to norman again and yes, we want to eventually only use the byId routes in the datastore. So the next steps (possibly in a follow-up PR) would be to duplicate all relevant datastore routes, and implement this for all dataset types, as well as switch the frontend to use only the new routes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
app/models/dataset/Dataset.scala (1)
890-905
: 🛠️ Refactor suggestionN+1 query pattern when fetching attachments
Inside
parseRow
we calldatasetLayerAttachmentsDAO.findAll…
for every layer.
For datasets with many layers this spawns one extra DB round-trip per layer.Consider bulk-loading all attachments once and grouping them in memory, or extending the original layer query with a
LEFT JOIN
to fetch attachments in a single shot.
♻️ Duplicate comments (2)
app/models/dataset/DatasetService.scala (2)
309-336
: 🛠️ Refactor suggestionInconsistent WKW-format handling persists
Past review already flagged that we both (a) pre-filter WKW via
Fox.fromBool
and (b) still throw aNotImplementedError
in the pattern-match.
Having two independent guards for the same condition is brittle and violates the “single source of truth” principle.Either drop the
Fox.fromBool
guard or replace theNotImplementedError
with aFox.failure
, but use exactly one style consistently.
315-392
: 🛠️ Refactor suggestionLarge duplication between colour & segmentation logic
The two pattern-matches are ~100 lines each and differ only in the concrete layer constructor. This was pointed out earlier and is still unresolved.
Please extract a helper that maps
(df, layerType)
→DataLayer
to eliminate duplication and ease future maintenance.Also applies to: 448-478
🧹 Nitpick comments (3)
app/models/dataset/DatasetService.scala (1)
488-493
: Network failure from datastore notification is unchecked
updateDatasetInDSCache
performs an HTTP request but its failure is ignored.
Consider wrapping the call inFox.tryo
and bubbling up / logging the error so that silent cache desynchronisation does not occur.app/models/dataset/Dataset.scala (2)
904-905
: Minor: redundant lambdarow.dataformat.flatMap(df => DataFormat.fromString(df))can be shortened to
row.dataformat.flatMap(DataFormat.fromString)
1066-1083
:parseAttachments
repeatedly filters the same listEach attachment type filters
rows
anew, producing five passes over the same collection.A quick micro-refactor:
- meshFiles <- Fox.serialCombined(rows.filter(_.`type` == ...))(parseRow) + val grouped = rows.groupBy(_.`type`) + meshFiles <- Fox.serialCombined(grouped.getOrElse(LayerAttachmentType.mesh.toString, Nil))(parseRow)reduces complexity with no behaviour change.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
app/controllers/WKRemoteDataStoreController.scala
(1 hunks)app/models/dataset/Dataset.scala
(10 hunks)app/models/dataset/DatasetService.scala
(5 hunks)app/models/dataset/WKRemoteDataStoreClient.scala
(2 hunks)conf/webknossos.latest.routes
(1 hunks)frontend/javascripts/admin/rest_api.ts
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
(4 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
(2 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
(1 hunks)webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- conf/webknossos.latest.routes
- app/controllers/WKRemoteDataStoreController.scala
- webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/DataSourceController.scala
- app/models/dataset/WKRemoteDataStoreClient.scala
- webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala
- webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
- frontend/javascripts/admin/rest_api.ts
- webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-smoketest-push
- GitHub Check: backend-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
app/models/dataset/DatasetService.scala (2)
307-488
: Large format-switch duplication persistsThe lengthy
match
cascades for data/segmentation layers replicate the logic
earlier flagged as duplicate. Extracting factory helpers (as already proposed
in a previous review) would remove this redundancy and simplify future
maintenance.
310-336
: Redundant WKW checksA top-level
Fox.fromBool
already rejectsDataFormat.wkw
; keeping an inner
case DataFormat.wkw => throw …
branch is superfluous and adds noise.Also applies to: 415-418
🧹 Nitpick comments (2)
app/models/dataset/DatasetService.scala (2)
293-306
:fullDataSourceFor
largely copiesdataSourceFor
– DRY it upBoth methods are identical except for the layer-retrieval call. Extract the
common part into a small helper:private def buildSource(dataset: Dataset, layersF: => Fox[List[DataLayer]]): Fox[InboxDataSource] = { for { layers <- layersF id = DataSourceId(dataset.directoryName, dataset._organization) scale <- dataset.voxelSize.toFox ?~> "dataset.source.usableButNoScale" } yield GenericDataSource[DataLayer](id, layers, scale) } def dataSourceFor(dataset: Dataset) = buildSource(dataset, datasetDataLayerDAO.findAllForDataset(dataset._id)) def fullDataSourceFor(dataset: Dataset) = buildSource(dataset, findLayersForDatasetWithMags(dataset._id))Keeps behaviour identical while cutting duplication.
394-395
: Return aFox.failure
instead of throwingThrowing
IllegalStateException
here is caught bytryo
, but causes verbose
stack-traces and loses the chance to provide an i18n key. Prefer:case None => Fox.failure("dataset.layer.missingDataFormat")for clearer error semantics and cleaner logs.
Also applies to: 480-482
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
app/models/dataset/DatasetService.scala
(5 hunks)app/models/dataset/WKRemoteDataStoreClient.scala
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- app/models/dataset/WKRemoteDataStoreClient.scala
🧰 Additional context used
🧠 Learnings (1)
app/models/dataset/DatasetService.scala (1)
Learnt from: frcroth
PR: scalableminds/webknossos#8657
File: app/models/dataset/DatasetService.scala:380-392
Timestamp: 2025-06-16T09:04:40.565Z
Learning: Trailing commas are supported in Scala starting from version 2.12.2 as part of SIP-27. They are valid in function calls, function definitions, import statements, and other contexts. This means trailing commas after the last parameter or argument do not cause compilation errors in Scala 2.12.2+.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-smoketest-push
- GitHub Check: backend-tests
_ <- notifyDatastoreOnUpdate(foundDataset._id) | ||
} yield foundDataset._id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Remote cache invalidation can make the entire update fail
notifyDatastoreOnUpdate
is executed inside the transactional flow of
updateDataSource*
. A transient network error when calling the remote
datastore would roll the whole DB update back, although only the cache could
not be refreshed.
Consider decoupling this call (e.g. run it in a Future
/fork
or dispatch a
message that can be retried) so the primary write succeeds even if the
datastore is unreachable, and log failures for later inspection.
Also applies to: 238-239
I’ll examine how notifyDatastoreOnUpdate
is invoked relative to the transaction boundaries. Retrieving the updateDataSource*
definitions, their transaction usage, and the call sites:
🏁 Script executed:
#!/bin/bash
set -e
# Locate updateDataSource methods
echo "=== updateDataSource definitions ==="
rg -n "def updateDataSource" app/models/dataset/DatasetService.scala
# Show surrounding lines of these methods
echo -e "\n=== Context around updateDataSource ==="
rg -n -C5 "updateDataSource" app/models/dataset/DatasetService.scala
# Locate notifyDatastoreOnUpdate calls
echo -e "\n=== notifyDatastoreOnUpdate usages ==="
rg -n -C3 "notifyDatastoreOnUpdate" app/models/dataset/DatasetService.scala
# Check for transaction boundaries (e.g., .transactionally or similar)
echo -e "\n=== Transaction markers ==="
rg -n "transactionally" app/models/dataset/DatasetService.scala || echo "No explicit .transactionally found"
Length of output: 6057
Decouple notifyDatastoreOnUpdate
from the transactional flow
The call to notifyDatastoreOnUpdate(foundDataset._id)
is currently inside the same for-comprehension (and thus the same transactional/monadic flow) as your database updates. Any transient failure contacting the remote datastore will cause the entire DB update to roll back. To ensure the primary write always succeeds:
• Move the cache-invalidation call outside the DB transaction or wrap it in a detached Future
/background job.
• Log and retry failures independently of the DB update.
Locations to update:
- app/models/dataset/DatasetService.scala: lines 210–217
- app/models/dataset/DatasetService.scala: lines 232–239
🤖 Prompt for AI Agents
In app/models/dataset/DatasetService.scala around lines 210 to 217 and 232 to
239, the call to notifyDatastoreOnUpdate is currently inside the transactional
flow, causing the entire database update to roll back if the remote cache
invalidation fails. To fix this, move the notifyDatastoreOnUpdate call outside
the transaction or run it asynchronously in a detached Future or background job,
ensuring the primary database update always succeeds. Additionally, add logging
and retry mechanisms for failures in cache invalidation to handle transient
errors without affecting the main transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationSource.scala (1)
6-13
: New mandatorydatasetId
will break deserialization of existing documents
datasetId
is introduced as a requiredObjectId
.
All persistedAnnotationSource
JSON documents that were written before this PR (or any client payloads not yet updated) lack this field, so:
Json.parse(..).as[AnnotationSource]
will now throwJsResultException
.- Play will return
400
for incoming requests that omit the property.Unless a full data-migration + coordinated client rollout is guaranteed, make the field optional or supply a custom
Reads
with a sensible fallback.- datasetId: ObjectId, + datasetId: Option[ObjectId],and later:
implicit val jsonFormat: OFormat[AnnotationSource] = Json.format[AnnotationSource] // Option[...] keeps backward compatIf the field must be mandatory, add a DB migration script and bump API version so that older clients fail fast and predictably.
Also double-check an implicitOFormat[ObjectId]
is in scope; otherwise compilation will fail here.
🧹 Nitpick comments (2)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala (2)
50-53
: Re-evaluate blanket CORS enablement
allowRemoteOrigin = true
exposes every endpoint to any origin.
Given that these routes hand out arbitrary binary data, you probably want a narrower CORS policy (e.g. configured origins, credentials flag, etc.) instead of a global wildcard.
320-344
: Private-link helpers silently fall back to datastore route
zArrayPrivateLink
/zarrJsonPrivateLink
default to datastore data when the layer isn’t found in the annotation source.
That makes debugging access mistakes tricky.Return 404 (or at least log a warning) when the caller explicitly asked for an annotation layer that does not exist instead of silently changing data source.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
app/models/annotation/AnnotationService.scala
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationSource.scala
(1 hunks)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
(1 hunks)webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes
(4 hunks)
✅ Files skipped from review due to trivial changes (1)
- app/models/annotation/AnnotationService.scala
🚧 Files skipped from review as they are similar to previous changes (2)
- webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala
- webknossos-datastore/conf/com.scalableminds.webknossos.datastore.routes
🧰 Additional context used
🧬 Code Graph Analysis (1)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala (15)
webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/AccessTokenService.scala (4)
webknossos
(66-68)validateAccessFromTokenContext
(83-90)readDataset
(42-45)validateAccessFromTokenContextForSyncBlock
(77-83)webknossos-datastore/app/com/scalableminds/webknossos/datastore/dataformats/zarr/ZarrCoordinatesParser.scala (2)
ZarrCoordinatesParser
(11-47)parseNDimensionalDotCoordinates
(13-46)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/Positions.scala (2)
VoxelPosition
(8-44)topLeft
(70-76)frontend/javascripts/types/api_types.ts (1)
AnnotationLayerType
(411-411)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationSource.scala (3)
AnnotationSource
(6-15)AnnotationSource
(17-19)getAnnotationLayer
(14-14)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/requests/Cuboid.scala (1)
Cuboid
(9-60)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/requests/DataServiceRequests.scala (1)
DataServiceDataRequest
(18-30)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DatasetCache.scala (2)
getWithLayer
(25-29)getById
(22-23)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteTracingstoreClient.scala (7)
getVolumeLayerAsZarrLayer
(40-48)getRawZarrCube
(59-63)getZArray
(31-35)getZarrJson
(35-40)getDataLayerMagDirectoryContents
(63-68)getDataLayerDirectoryContents
(68-73)getZGroup
(73-76)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/DSRemoteWebknossosClient.scala (1)
getAnnotationSource
(172-182)util/src/main/scala/com/scalableminds/util/mvc/ExtendedController.scala (1)
tokenContextForRequest
(226-228)webknossos-datastore/app/com/scalableminds/webknossos/datastore/models/annotation/AnnotationLayer.scala (1)
typ
(76-79)webknossos-datastore/app/com/scalableminds/webknossos/datastore/services/BinaryDataService.scala (1)
handleDataRequests
(159-178)webknossos-datastore/app/com/scalableminds/webknossos/datastore/datareaders/zarr/ZarrHeader.scala (1)
fromLayer
(59-85)webknossos-datastore/app/com/scalableminds/webknossos/datastore/datareaders/zarr3/Zarr3ArrayHeader.scala (1)
fromDataLayer
(262-295)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: backend-tests
- GitHub Check: build-smoketest-push
private def convertLayerToZarrLayer(layer: DataLayer, zarrVersion: Int): ZarrLayer = { | ||
val dataFormat = if (zarrVersion == 2) DataFormat.zarr else DataFormat.zarr3 | ||
layer match { | ||
case s: SegmentationLayer => | ||
val rank = s.additionalAxes.map(_.length).getOrElse(0) + 4 // We’re writing c, additionalAxes, xyz | ||
ZarrSegmentationLayer( | ||
s.name, | ||
s.boundingBox, | ||
s.elementClass, | ||
mags = s.sortedMags.map(x => MagLocator(x, None, None, Some(AxisOrder.cAdditionalxyz(rank)), None, None)), | ||
mappings = s.mappings, | ||
largestSegmentId = s.largestSegmentId, | ||
numChannels = Some(if (s.elementClass == ElementClass.uint24) 3 else 1), | ||
defaultViewConfiguration = s.defaultViewConfiguration, | ||
adminViewConfiguration = s.adminViewConfiguration, | ||
coordinateTransformations = s.coordinateTransformations, | ||
additionalAxes = s.additionalAxes.map(reorderAdditionalAxes), | ||
dataFormat = dataFormat | ||
) | ||
case d: DataLayer => | ||
val rank = d.additionalAxes.map(_.length).getOrElse(0) + 4 // We’re writing c, additionalAxes, xyz | ||
ZarrDataLayer( | ||
d.name, | ||
d.category, | ||
d.boundingBox, | ||
d.elementClass, | ||
mags = d.sortedMags.map(x => MagLocator(x, None, None, Some(AxisOrder.cAdditionalxyz(rank)), None, None)), | ||
numChannels = Some(if (d.elementClass == ElementClass.uint24) 3 else 1), | ||
defaultViewConfiguration = d.defaultViewConfiguration, | ||
adminViewConfiguration = d.adminViewConfiguration, | ||
coordinateTransformations = d.coordinateTransformations, | ||
additionalAxes = d.additionalAxes.map(reorderAdditionalAxes), | ||
dataFormat = dataFormat | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Duplicate layer-conversion code – centralise and test
convertLayerToZarrLayer
is another copy of logic already living in the original controller.
Apart from maintenance concerns, the two versions must stay byte-identical to avoid subtle divergences (e.g. rank calculation, element-class channel logic).
Move this transformation into a single helper (maybe inside ZarrLayerFactory
) and unit-test it once.
🤖 Prompt for AI Agents
In
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala
between lines 154 and 189, the convertLayerToZarrLayer method duplicates logic
present elsewhere, risking inconsistencies. Refactor by extracting this
conversion logic into a single helper method, ideally within a dedicated object
like ZarrLayerFactory, and replace all existing copies with calls to this
helper. Add unit tests for this helper to ensure correctness and
maintainability.
/** | ||
* | ||
* This is a copy of the ZarrStreamingController using datasetId instead of organizationId/datasetDirectoryName | ||
*/ | ||
class WKZarrStreamingController @Inject()( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid large-scale copy-paste – extract shared logic
The whole controller is a near verbatim copy of ZarrStreamingController
with only the lookup key changed.
This will double maintenance effort and risks the two controllers diverging silently.
Consider extracting the common logic into reusable services / mix-ins and keep the two tiny façade controllers (or, once migration is finished, just one controller).
This also makes upcoming “duplicate all routes” work trivial.
🤖 Prompt for AI Agents
In
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala
around lines 34 to 38, the WKZarrStreamingController is almost an exact copy of
ZarrStreamingController except for the lookup key. To fix this, refactor by
extracting the shared logic into a common service or trait that both controllers
can use, then simplify each controller to just handle the specific lookup key
differences. This reduces duplication, eases maintenance, and prevents
divergence between the two controllers.
private def rawZarrCube( | ||
datasetId: String, | ||
dataLayerName: String, | ||
mag: String, | ||
coordinates: String, | ||
)(implicit m: MessagesProvider, tc: TokenContext): Fox[Result] = | ||
for { | ||
datasetIdValidated <- ObjectId.fromString(datasetId) | ||
(dataSource, dataLayer) <- datasetCache.getWithLayer(datasetIdValidated, dataLayerName) ~> SERVICE_UNAVAILABLE | ||
reorderedAdditionalAxes = dataLayer.additionalAxes.map(reorderAdditionalAxes) | ||
(x, y, z, additionalCoordinates) <- ZarrCoordinatesParser.parseNDimensionalDotCoordinates( | ||
coordinates, | ||
reorderedAdditionalAxes) ?~> "zarr.invalidChunkCoordinates" ~> NOT_FOUND | ||
magParsed <- Vec3Int | ||
.fromMagLiteral(mag, allowScalar = true) | ||
.toFox ?~> Messages("dataLayer.invalidMag", mag) ~> NOT_FOUND | ||
_ <- Fox.fromBool(dataLayer.containsMag(magParsed)) ?~> Messages("dataLayer.wrongMag", dataLayerName, mag) ~> NOT_FOUND | ||
cubeSize = DataLayer.bucketLength | ||
request = DataServiceDataRequest( | ||
Some(dataSource.id), | ||
dataLayer, | ||
Cuboid( | ||
topLeft = VoxelPosition(x * cubeSize * magParsed.x, | ||
y * cubeSize * magParsed.y, | ||
z * cubeSize * magParsed.z, | ||
magParsed), | ||
width = cubeSize, | ||
height = cubeSize, | ||
depth = cubeSize | ||
), | ||
DataServiceRequestSettings(halfByte = false, additionalCoordinates = additionalCoordinates) | ||
) | ||
(data, notFoundIndices) <- binaryDataService.handleDataRequests(List(request)) | ||
_ <- Fox.fromBool(notFoundIndices.isEmpty) ~> "zarr.chunkNotFound" ~> NOT_FOUND | ||
} yield Ok(data) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary cube response lacks content type & may exhaust memory
rawZarrCube
streams the full chunk into a byte array and returns Ok(data)
without setting Content-Type
.
Two issues:
- Clients cannot rely on MIME type – send
application/octet-stream
. - Holding the whole chunk (~134 MB for 512³ uint8) in memory before sending can blow the heap under load.
Use Play’sChunked
orHttpEntity.Streamed
with an Akka Source to stream frombinaryDataService
.
Example patch:
- } yield Ok(data)
+ } yield
+ Ok.sendEntity(
+ HttpEntity.Strict(ByteString(data), Some("application/octet-stream"))
+ )
(Or switch to a streamed entity.)
🤖 Prompt for AI Agents
In
webknossos-datastore/app/com/scalableminds/webknossos/datastore/controllers/WKZarrStreamingController.scala
around lines 241 to 276, the rawZarrCube method returns the entire binary chunk
as a byte array with Ok(data) but does not set the Content-Type header, and it
loads the entire chunk into memory which risks heap exhaustion. Fix this by
setting the response Content-Type to "application/octet-stream" and refactor the
method to stream the data using Play's Chunked or HttpEntity.Streamed with an
Akka Source instead of loading it fully into memory before sending.
@fm3 I changed it to use an AlfuCache and duplicated the zarr routes as discussed. The other routes are all about files so not relevant to this change (remote datasets). I also removed the frontend changes. If you want to test them again, see this commit here: f62a7d7#diff-1833767de35826d9954d2ef7d077f41f8b69861833801e63b9dd77e47d4c6d27 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I think with the AlfuCache this makes a lot of sense :) Thanks for implementing the smaller comments as well.
I’m a little unsure about the new code duplication, see comment above.
lazy val cache: AlfuCache[ObjectId, DataSource] = AlfuCache[ObjectId, DataSource]( | ||
timeToLive = 1 day // Cache for a longer time, since we invalidate the cache manually | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to increase the capacity here, default is 1k. How about 5k? DataSource objects should not use up all that much memory.
extends Controller | ||
with Zarr3OutputHelper { | ||
|
||
override def defaultErrorCode: Int = NOT_FOUND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whew, that’s really a lot of duplication. @normanrz what do you think, how long after switching to the by-id routes internally do we need to support both variants to facilitate libs compatibility? If it is long, it may be worth taking the time now to extract the duplicated functionality into separate services, or possibly inject the ZarrStreamingController itself into this WkZarrStreamingController and move the shared functionality into functions accessible by both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we could go the other way round, and have the by-orgaId-and-datasetName routes do a lookup of the id, and then call the methods of the new controllers? If both lookups are cached on the datastore side, there should not be a significant performance hit for that, right? ok no I guess that will work only once all dataset types are accessible in this way, not just remote datasets. So maybe for the next iteration
URL of deployed dev instance (used for testing):
This PR adds routes in the datastore to access datasets by ID as opposed to by organizationId and directory. Datasources are cached in the datastore, the configuration is generated by WK from the DB. WK can also invalidate the DS cache when the dataset was changed. This PR only supports remote datasets, and especially no WKW datasets.
This adds new routes with the "wkDataset" path, since the dataset routes directly follow with the organizationId, making routes that use the datasetId there ambiguous.
Zarr routes are also duplicated to use dataset ids. In a followup, once wkw datasets are also supported, the old routes should be removed and the new ones can be renamed to have nicer sounding routes.Steps to test:
TODOs:
Issues:
(Please delete unneeded items, merge only when none are left open)
Updated changelogNo user facing changes for nowAdapted wk-libs python client if relevant API parts changeAPI only appended right now