Skip to content

Conversation

@TheByronHimes
Copy link
Member

@TheByronHimes TheByronHimes commented Nov 28, 2025

Adds GET /boxes/{box_id} endpoint to retrieve an existing FileUploadBox. The new endpoint requires a WPS-signed "view"-type WOT carrying a box_id that matches the path parameter.

The POST /boxes/{box_id}/uploads endpoint, which lets the user initiate a new FileUpload, will now prevent a user from uploading data for a given FileUploadBox beyond a configured limit (file_box_size_limit, default is 150 TiB). The core method called from that endpoint now calculates the total announced size of all files associated with the box, both in-progress and completed, and rejects the currently proposed file if the additional size would exceed the per-box limit. This ensures that even if uploads are parallelized at the file level (current Connector implementation does parallel file parts but only 1 file at a time), the limit is respected.

Important to note is that the size cap pertains to the unencrypted size. The reasoning behind this was that the user does not control encryption, and while our current encryption makes file size growth calculable, the user will be more comfortable thinking in terms of actual/unencrypted file sizes. On our side we can decide how much encrypted data to allow per file box, then translate that to the corresponding unencrypted size when configuring the limit.

Unencrytped Sizes Box Limit Encrypted Sizes All Accepted?
9 TB + 1 TB = 10 TB 10 TB 9.45 TB + 1.05 TB = 10.5 TB Yes
9 TB + 2 TB = 11 TB 10 TB 9.45 TB + 2.1 TB = 11.55 TB No, 2nd file rejected

@TheByronHimes TheByronHimes requested a review from Cito November 28, 2025 10:35
Copy link
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See two questions/suggestions below.

mapping = {"box_id": box_id}
return sum(
[x.size async for x in self._file_upload_dao.find_all(mapping=mapping)]
)
Copy link
Member

@Cito Cito Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create an async sum function that can sum over an async generator, then you don't need to create a list before the summation.

We now do this summation twice - when an upload finishes and before it starts. Can we optimize this by updating the total size on the fly? Maybe keeping two sums, one for completed and one for incomplete uploads?

Btw, we also need to adapt the size when we remove uploads - are we doing this?

For more speedup, we could add an index for the box_id. Or aggregate in the database. Maybe we can add some basic aggregation functionality to hexkit, so that we don't need to load all datasets in order to compute the sum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we already adapt the size when uploads are removed. An index would be good, we just need to communicate that to @ckaipf.
In 1b618db I changed the calculation so that files count toward the box size as soon as they are announced, rather than only once completed. The other change is that the calculation is only run when a file is added or deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are talking about an MongoDB index for the collection, right? Didn't we do that before somewhere, wasn't it configured in a client?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is also an option.

Copy link
Member

@Cito Cito Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckaipf The mass service adds indexes on startup.

Actually, hexkit has a parameter fields_to_index when setting up DAOs, it's just not implemented yet in the Mongo provider. @TheByronHimes Should we use this as an opportunity to finally implement it?

Adding some simple aggregation methods to the DAOs would be also great, but is a bit more effort if we want to make it somewhat flexible (other aggregates than sums, maybe grouping). We should put that in the backlog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a look at how to generalize index implementation - the current protocol would certainly have to change, since a list of field names is only enough info to apply a standard-sort index on one field at a time. Or we can start with that and revamp it later to add more full-fledged support for other index types and sort directions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can allow both - if you pass a list of field names, it builds standard (non-unique, ascending) indexes. If you pass a mapping of field name to index options, it can build more sophisticated indexes. But we can start with the simple case that would be also sufficient here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t currently have a good declarative way to configure MongoDB. So if the index is something the application logic depends on, I think it would be best to ensure it on the client side. If that turns out to be too much work, I can also put something together that does the job.


service_name: str = SERVICE_NAME

file_box_size_limit: int = Field(
Copy link
Member

@Cito Cito Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to take this limit as parameter when creating a FileUploadBox and storing it there? The limit would then be specified on a higher level by the UOS, maybe even individually per inbox.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. If the it was applied per box, then the total data size would always need to be known up front or else the limit has to be arbitrarily high. If it's arbitrarily high, then the individually applied limit doesn't provide an advantage over a universal limit. On the other hand, if submitters can always and definitively say their data will be of size N or less, then we can establish a tailored limit. We would at some point need to update the PATCH endpoint to allow the Data Steward to modify the box limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants