UCS: Cap total box size (GSI-1995) #148

TheByronHimes · 2025-11-28T09:50:16Z

Adds GET /boxes/{box_id} endpoint to retrieve an existing FileUploadBox. The new endpoint requires a WPS-signed "view"-type WOT carrying a box_id that matches the path parameter.

The POST /boxes/{box_id}/uploads endpoint, which lets the user initiate a new FileUpload, will now prevent a user from uploading data for a given FileUploadBox beyond a configured limit (file_box_size_limit, default is 150 TiB). The core method called from that endpoint now calculates the total announced size of all files associated with the box, both in-progress and completed, and rejects the currently proposed file if the additional size would exceed the per-box limit. This ensures that even if uploads are parallelized at the file level (current Connector implementation does parallel file parts but only 1 file at a time), the limit is respected.

Important to note is that the size cap pertains to the unencrypted size. The reasoning behind this was that the user does not control encryption, and while our current encryption makes file size growth calculable, the user will be more comfortable thinking in terms of actual/unencrypted file sizes. On our side we can decide how much encrypted data to allow per file box, then translate that to the corresponding unencrypted size when configuring the limit.

Unencrytped Sizes	Box Limit	Encrypted Sizes	All Accepted?
9 TB + 1 TB = 10 TB	10 TB	9.45 TB + 1.05 TB = 10.5 TB	Yes
9 TB + 2 TB = 11 TB	10 TB	9.45 TB + 2.1 TB = 11.55 TB	No, 2nd file rejected

Cito

See two questions/suggestions below.

Cito · 2025-11-28T11:44:11Z

services/ucs/src/ucs/core/controller.py

+        mapping = {"box_id": box_id}
+        return sum(
+            [x.size async for x in self._file_upload_dao.find_all(mapping=mapping)]
+        )


Maybe create an async sum function that can sum over an async generator, then you don't need to create a list before the summation.

We now do this summation twice - when an upload finishes and before it starts. Can we optimize this by updating the total size on the fly? Maybe keeping two sums, one for completed and one for incomplete uploads?

Btw, we also need to adapt the size when we remove uploads - are we doing this?

For more speedup, we could add an index for the box_id. Or aggregate in the database. Maybe we can add some basic aggregation functionality to hexkit, so that we don't need to load all datasets in order to compute the sum?

Yes, we already adapt the size when uploads are removed. An index would be good, we just need to communicate that to @ckaipf.
In 1b618db I changed the calculation so that files count toward the box size as soon as they are announced, rather than only once completed. The other change is that the calculation is only run when a file is added or deleted.

We are talking about an MongoDB index for the collection, right? Didn't we do that before somewhere, wasn't it configured in a client?

Yes, that is also an option.

@ckaipf The mass service adds indexes on startup.

Actually, hexkit has a parameter fields_to_index when setting up DAOs, it's just not implemented yet in the Mongo provider. @TheByronHimes Should we use this as an opportunity to finally implement it?

Adding some simple aggregation methods to the DAOs would be also great, but is a bit more effort if we want to make it somewhat flexible (other aggregates than sums, maybe grouping). We should put that in the backlog.

I can take a look at how to generalize index implementation - the current protocol would certainly have to change, since a list of field names is only enough info to apply a standard-sort index on one field at a time. Or we can start with that and revamp it later to add more full-fledged support for other index types and sort directions.

We can allow both - if you pass a list of field names, it builds standard (non-unique, ascending) indexes. If you pass a mapping of field name to index options, it can build more sophisticated indexes. But we can start with the simple case that would be also sufficient here.

We don’t currently have a good declarative way to configure MongoDB. So if the index is something the application logic depends on, I think it would be best to ensure it on the client side. If that turns out to be too much work, I can also put something together that does the job.

Cito · 2025-11-28T11:46:43Z

services/ucs/src/ucs/config.py


    service_name: str = SERVICE_NAME
+
+    file_box_size_limit: int = Field(


Wouldn't it make sense to take this limit as parameter when creating a FileUploadBox and storing it there? The limit would then be specified on a higher level by the UOS, maybe even individually per inbox.

I'm not sure. If the it was applied per box, then the total data size would always need to be known up front or else the limit has to be arbitrarily high. If it's arbitrarily high, then the individually applied limit doesn't provide an advantage over a universal limit. On the other hand, if submitters can always and definitively say their data will be of size N or less, then we can establish a tailored limit. We would at some point need to update the PATCH endpoint to allow the Data Steward to modify the box limit.

TheByronHimes added 8 commits November 27, 2025 14:08

UCS: Add GET FileUploadBox endpoint

6fcdbcb

Add size limit config

932fdb8

Define HttpNotEnoughSpaceError

858fe09

Restrict size of file upload box across file uploads

c9c57f1

Add tests

89d5d3b

Update UCS openapi spec

f77fe96

Replace prototype in-mem DAO with the hexkit version

0ff0273

Remove prototype in-mem object storage fixture

8eed233

TheByronHimes requested a review from Cito November 28, 2025 10:35

Cito reviewed Nov 28, 2025

View reviewed changes

Update box stats upon init and delete only

1b618db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UCS: Cap total box size (GSI-1995) #148

UCS: Cap total box size (GSI-1995) #148

Uh oh!

TheByronHimes commented Nov 28, 2025 •

edited

Loading

Uh oh!

Cito left a comment

Uh oh!

Cito Nov 28, 2025 •

edited

Loading

Uh oh!

TheByronHimes Nov 28, 2025

Uh oh!

ckaipf Nov 28, 2025

Uh oh!

TheByronHimes Nov 28, 2025

Uh oh!

Cito Nov 28, 2025 •

edited

Loading

Uh oh!

TheByronHimes Nov 28, 2025

Uh oh!

Cito Nov 28, 2025

Uh oh!

ckaipf Nov 29, 2025

Uh oh!

Cito Nov 28, 2025 •

edited

Loading

Uh oh!

TheByronHimes Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		service_name: str = SERVICE_NAME

		file_box_size_limit: int = Field(

UCS: Cap total box size (GSI-1995) #148

Are you sure you want to change the base?

UCS: Cap total box size (GSI-1995) #148

Uh oh!

Conversation

TheByronHimes commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cito left a comment

Choose a reason for hiding this comment

Uh oh!

Cito Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheByronHimes Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ckaipf Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

TheByronHimes Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Cito Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheByronHimes Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Cito Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ckaipf Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

Cito Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheByronHimes Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TheByronHimes commented Nov 28, 2025 •

edited

Loading

Cito Nov 28, 2025 •

edited

Loading

Cito Nov 28, 2025 •

edited

Loading

Cito Nov 28, 2025 •

edited

Loading