-
Notifications
You must be signed in to change notification settings - Fork 0
UCS: Cap total box size (GSI-1995) #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Cito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See two questions/suggestions below.
| mapping = {"box_id": box_id} | ||
| return sum( | ||
| [x.size async for x in self._file_upload_dao.find_all(mapping=mapping)] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe create an async sum function that can sum over an async generator, then you don't need to create a list before the summation.
We now do this summation twice - when an upload finishes and before it starts. Can we optimize this by updating the total size on the fly? Maybe keeping two sums, one for completed and one for incomplete uploads?
Btw, we also need to adapt the size when we remove uploads - are we doing this?
For more speedup, we could add an index for the box_id. Or aggregate in the database. Maybe we can add some basic aggregation functionality to hexkit, so that we don't need to load all datasets in order to compute the sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we already adapt the size when uploads are removed. An index would be good, we just need to communicate that to @ckaipf.
In 1b618db I changed the calculation so that files count toward the box size as soon as they are announced, rather than only once completed. The other change is that the calculation is only run when a file is added or deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are talking about an MongoDB index for the collection, right? Didn't we do that before somewhere, wasn't it configured in a client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is also an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ckaipf The mass service adds indexes on startup.
Actually, hexkit has a parameter fields_to_index when setting up DAOs, it's just not implemented yet in the Mongo provider. @TheByronHimes Should we use this as an opportunity to finally implement it?
Adding some simple aggregation methods to the DAOs would be also great, but is a bit more effort if we want to make it somewhat flexible (other aggregates than sums, maybe grouping). We should put that in the backlog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look at how to generalize index implementation - the current protocol would certainly have to change, since a list of field names is only enough info to apply a standard-sort index on one field at a time. Or we can start with that and revamp it later to add more full-fledged support for other index types and sort directions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can allow both - if you pass a list of field names, it builds standard (non-unique, ascending) indexes. If you pass a mapping of field name to index options, it can build more sophisticated indexes. But we can start with the simple case that would be also sufficient here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don’t currently have a good declarative way to configure MongoDB. So if the index is something the application logic depends on, I think it would be best to ensure it on the client side. If that turns out to be too much work, I can also put something together that does the job.
|
|
||
| service_name: str = SERVICE_NAME | ||
|
|
||
| file_box_size_limit: int = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to take this limit as parameter when creating a FileUploadBox and storing it there? The limit would then be specified on a higher level by the UOS, maybe even individually per inbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. If the it was applied per box, then the total data size would always need to be known up front or else the limit has to be arbitrarily high. If it's arbitrarily high, then the individually applied limit doesn't provide an advantage over a universal limit. On the other hand, if submitters can always and definitively say their data will be of size N or less, then we can establish a tailored limit. We would at some point need to update the PATCH endpoint to allow the Data Steward to modify the box limit.
Adds
GET /boxes/{box_id}endpoint to retrieve an existingFileUploadBox. The new endpoint requires a WPS-signed"view"-type WOT carrying abox_idthat matches the path parameter.The
POST /boxes/{box_id}/uploadsendpoint, which lets the user initiate a newFileUpload, will now prevent a user from uploading data for a givenFileUploadBoxbeyond a configured limit (file_box_size_limit, default is 150 TiB). The core method called from that endpoint now calculates the total announced size of all files associated with the box, both in-progress and completed, and rejects the currently proposed file if the additional size would exceed the per-box limit. This ensures that even if uploads are parallelized at the file level (current Connector implementation does parallel file parts but only 1 file at a time), the limit is respected.Important to note is that the size cap pertains to the unencrypted size. The reasoning behind this was that the user does not control encryption, and while our current encryption makes file size growth calculable, the user will be more comfortable thinking in terms of actual/unencrypted file sizes. On our side we can decide how much encrypted data to allow per file box, then translate that to the corresponding unencrypted size when configuring the limit.