[HDDS-3331] Ozone Volume Management (accepted)
 
Introduction
This document explores how we can improve the Ozone volume semantics especially with respect to the S3 compatibility layer.
The Problems
- Unprivileged users cannot enumerate volumes.
- The mapping of S3 buckets to Ozone volumes is confusing. Based on external feedback it’s hard to understand the exact Ozone URL to be used.
- The volume name is not friendly and cannot be remembered by humans.
- Ozone buckets created via the native object store interface are not visible via the S3 gateway.
- We don’t support the revocation of access keys.
We explore some of these in more detail in subsequent sections.
Volume enumeration problem
Currently when a user enumerates volumes, they see the list of volumes that they own. This means that when an unprivileged user enumerates volumes, it always gets any empty list. Instead users should be able to see all volumes that they have been granted read or write access to.
This also has an impact on ofs which makes volumes appear as top-level directories.
S3 to HCFS path mapping problem
Ozone has the semantics of volume and buckets while S3 has only buckets. To make it possible to use the same bucket both from Hadoop world and via S3 we need a mapping between them.
Currently we maintain a map between the S3 buckets and Ozone volumes + buckets in OmMetadataManagerImpl
s3_bucket --> ozone_volume/ozone_bucket
The current implementation uses the "s3" + s3UserName
string as the volume name and the s3BucketName
as the bucket name. Where s3UserName
is is the DigestUtils.md5Hex(kerberosUsername.toLowerCase())
To create an S3 bucket and use it from o3fs, you should:
- Get your personal secret based on your kerberos keytab
> kinit -kt /etc/security/keytabs/testuser.keytab testuser/scm
> ozone s3 getsecret
awsAccessKey=testuser/scm@EXAMPLE.COM
awsSecret=7a6d81dbae019085585513757b1e5332289bdbffa849126bcb7c20f2d9852092
- Create the bucket with S3 cli
> export AWS_ACCESS_KEY_ID=testuser/scm@EXAMPLE.COM
> export AWS_SECRET_ACCESS_KEY=7a6d81dbae019085585513757b1e5332289bdbffa849126bcb7c20f2d9852092
> aws s3api --endpoint http://localhost:9878 create-bucket --bucket=bucket1
- And identify the ozone path
> ozone s3 path bucket1
Volume name for S3Bucket is : s3c89e813c80ffcea9543004d57b2a1239
Ozone FileSystem Uri is : o3fs://bucket1.s3c89e813c80ffcea9543004d57b2a1239
Proposed solution[1]
Supporting multiple access keys (#5 from the problem listing)
Problem #5 can be easily supported with improving the ozone s3
CLI. Ozone has a separated table for the S3 secrets and the API can be improved to handle multiple secrets for one specific kerberos user.
Solving the mapping problem (#2-4 from the problem listing)
- Let’s always use
s3v
volume for all the s3 buckets if the bucket is created from the s3 interface.
This is an easy an fast method, but with this approach not all the volumes are available via the S3 interface. We need to provide a method to publish any of the ozone volumes / buckets.
- Let’s improve the existing toolset to expose any Ozone volume/bucket as an s3 bucket. (Eg. expose
o3:/vol1/bucketx
as an S3 buckets3://foobar
)
Implementation:
The first part is easy compared to the current implementation. We don’t need any mapping table any more.
To implement the second (expose ozone buckets as s3 buckets) we have multiple options:
- Store some metadata (** s3 bucket name **) on each of the buckets
- Implement a symbolic link mechanism which makes it possible to link to any volume/buckets from the “s3” volume.
The first approach required a secondary cache table and it violates the naming hierarchy. The s3 bucket name is a global unique name, therefore it’s more than just a single attribute on a specific object. It’s more like an element in the hierarchy. For this reason the second option is proposed:
For example if the default s3 volume is s3v
- Every new buckets created via s3 interface will be placed under the
/s3v
volume - Any existing Ozone buckets can be exposed by linking to it from s3:
ozone sh bucket link /vol1/bucket1 /s3v/s3bucketname
Lock contention problem
One possible problem with using just one volume is using the locks of the same volume for all the S3 buckets (thanks Xiaoyu). But this shouldn’t be a big problem.
- We hold only a READ lock. Most of the time it can acquired without any contention (writing lock is required only to change owner / set quota)
- For symbolic link the read lock is only required for the first read. After that the lock of the referenced volume will be used. In case of any performance problem multiple volumes and links can be used.
Note: Sanjay is added to the authors as the original proposal of this approach.
Implementation details
bucket link
operation creates a link bucket. Links are like regular buckets, stored in DB the same way, but with two new, optional pieces of information: source volume and bucket. (The bucket being referenced by the link is called “source”, not “target”, to follow symlink terminology.)- Link buckets share the namespace with regular buckets. If a bucket or link with the same name already exists, a
BUCKET_ALREADY_EXISTS
result is returned. - Link buckets are not inherently specific to a user, access is restricted only by ACL.
- Link buckets retain their owner ACLs, which are inherited from the default ACLs of their volume. Additionally, link buckets allow anyone to have READ and WRITE permissions, which is similar to Linux POSIX symbolic.
- All add/set/remove ACL operation proxy to the source bucket. Getacl operation of a link bucket shows link bucket’s ACL.
- Links are persistent, ie. they can be used until they are deleted.
- Existing bucket operations (info, delete, ACL) work on the link object in the same way as they do on regular buckets. No new link-specific RPC is required.
- Links are followed for key operations (list, get, put, etc.). Read permission on the source bucket is required for this.
- Checks for existence of the source bucket, as well as ACL, are performed only when following the link (similar to symlinks). Source bucket is not checked when operating on the link bucket itself (eg. deleting it). This avoids the need for reverse checks for each bucket delete or ACL change.
- Bucket links are generic, not restricted to the
s3v
volume.
Alternative approaches and reasons to reject
To solve the the s3 bucket name to ozone bucket name mapping problem some other approaches are also considered. They are rejected but keeping them in this section together with the reasons to reject.
1. Predefined volume mapping
- Let’s support multiple
ACCESS_KEY_ID
for the same user. - For each
ACCESS_KEY_ID
a volume name MUST be defined. - Instead of using a specific mapping table, the
ACCESS_KEY_ID
would provide a view of the buckets in the specified volume.
With this approach the used volume will be more visible and – hopefully – understandable.
Instead of using ozone s3 getsecret
, following commands would be used:
ozone s3 secret create --volume=myvolume
: To create a secret and use myvolume for all of these bucketsozone s3 secret list
: To list all of the existing S3 secrets (available for the current user)ozone s3 secret delete <ACCESS_KEY_ID
: To delete any secret
The AWS_ACCESS_KEY_ID
should be a random identifier instead of using a kerberos principal.
- pro: Easier to understand
- con: We should either have global unique bucket names or it will be possible to see two different buckets with
- con: It can be hard to remember which volumes are assigned to a specific ACCESS_KEY_ID
3. String Magic
We can try to make volume name visible for the S3 world by using some structured bucket names. Unfortunately the available separator characters are very limited:
For example we can’t use /
aws s3api create-bucket --bucket=vol1/bucket1
Parameter validation failed:
Invalid bucket name "vol1/bucket1": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
But it’s possible to use volume-bucket
notion:
aws s3api create-bucket --bucket=vol1-bucket1
- pro: Volume mapping is visible all the time.
- con: Harder to use any external tool with defaults (all the buckets should have at least one
-
) - con: Hierarchy is not visble. The uniform way to separated elements in fs hierarchy is
/
. It can be confusing.
4. Remove volume From OzoneFs Paths
We can also make volumes a lightweight bucket group object by removing it from the ozonefs path. With this approach we can use all the benefits of the volumes as an administration object but it would be removed from the o3fs
path.
- pro: can be the most simple solution. Easy to understand as there are no more volumes in the path.
- con: Bigger change (all the API can’t be modified to make volumes optional)
- con: Harder to dis-joint namespaces based on volumes. (With the current scheme, it’s easier to delegate the responsibilities for one volumes to a different OM).
- con: We lose volumes as the top-level directories in
ofs
scheme. - con: One level of hierarchy might not be enough in case of multi-tenancy.
- con: One level of hierarchy is not enough if we would like to provide separated level for users and admins
- con: Hierarchical abstraction can be easier to manage and understand