Accessing Apache Ozone from Python
Apache Ozone project itself does not provide Python client libraries. However, several third-party open source libraries can be used to build applications to access an Ozone cluster via different interfaces: OFS file system, Ozone HTTPFS REST API and Ozone S3.
This document outlines these approaches, providing concise setup instructions and validated code examples.
Setup and Prerequisites
Before starting, ensure the following:
- Python installed (3.x recommended)
- Apache Ozone configured and accessible
- For PyArrow with libhdfs:
- PyArrow library (
pip install pyarrow
) - Hadoop native libraries configured and Ozone classpath specified (see below for details)
- PyArrow library (
- For S3 access:
- Boto3 (
pip install boto3
) - Ozone S3 Gateway endpoint and bucket names
- Access credentials (AWS-like key and secret)
- Boto3 (
- For HttpFS access:
- Requests (
pip install requests
) or fsspec (pip install fsspec
)
- Requests (
Method 1: Access Ozone via PyArrow and libhdfs
This approach leverages PyArrow’s HadoopFileSystem API, which requires libhdfs.so native library. The libhdfs.so is not packaged within PyArrow and you must download it separately from Hadoop.
Configuration
Ensure Ozone configuration files (core-site.xml and ozone-site.xml) are available and OZONE_CONF_DIR
is set.
Also ensure ARROW_LIBHDFS_DIR
and CLASSPATH
are set properly.
For example,
export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/
export CLASSPATH=$(ozone classpath ozone-tools)
Code Example
import pyarrow.fs as pafs
# Connect to Ozone using HadoopFileSystem
# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml
fs = pafs.HadoopFileSystem("default")
# Create a directory inside the bucket
fs.create_dir("volume/bucket/aaa")
# Write data to a file
path = "volume/bucket/file1"
with fs.open_output_stream(path) as stream:
stream.write(b'data')
Note: configure fs.defaultFS in the core-site.xml to point to the Ozone cluster. For example,
<configuration>
<property>
<name>fs.defaultFS</name>
<value>ofs://om:9862</value>
<description>Ozone Manager endpoint</description>
</property>
</configuration>
Try it yourself! Check out PyArrow Tutorial for a quick start using Ozone’s Docker image.
Method 2: Access Ozone via Boto3 and S3 Gateway
Configuration
- Identify your Ozone S3 Gateway endpoint (e.g.,
http://s3g:9878
). - Use AWS-compatible credentials (from Ozone).
Code Example
import boto3
# Create a local file to upload
with open("localfile.txt", "w") as f:
f.write("Hello from Ozone via Boto3!\n")
# Configure Boto3 client
s3 = boto3.client(
's3',
endpoint_url='http://s3g:9878',
aws_access_key_id='ozone-access-key',
aws_secret_access_key='ozone-secret-key'
)
# List buckets
response = s3.list_buckets()
print("Buckets:", response['Buckets'])
# Upload the file
s3.upload_file('localfile.txt', 'bucket', 'file.txt')
print("Uploaded 'localfile.txt' to 'bucket/file.txt'")
# Download the file back
s3.download_file('bucket', 'file.txt', 'downloaded.txt')
print("Downloaded 'file.txt' as 'downloaded.txt'")
Note: Replace endpoint URL, credentials, and bucket names with your setup.
Try it yourself! Check out Boto3 Tutorial for a quick start using Ozone’s Docker image.
Method 3: Access Ozone via HttpFS REST API
First, install requests Python module:
pip install requests
Configuration
- Use Ozone’s HTTPFS endpoint (e.g.,
http://httpfs:14000
).
Code Example (requests)
#!/usr/bin/python
import requests
# Ozone HTTPFS endpoint and file path
host = "http://httpfs:14000"
volume = "vol1"
bucket = "bucket1"
filename = "hello.txt"
path = f"/webhdfs/v1/{volume}/{bucket}/{filename}"
user = "ozone" # can be any value in simple auth mode
# Step 1: Initiate file creation (responds with 307 redirect)
params_create = {
"op": "CREATE",
"overwrite": "true",
"user.name": user
}
print("Creating file...")
resp_create = requests.put(host + path, params=params_create, allow_redirects=False)
if resp_create.status_code != 307:
print(f"Unexpected response: {resp_create.status_code}")
print(resp_create.text)
exit(1)
redirect_url = resp_create.headers['Location']
print(f"Redirected to: {redirect_url}")
# Step 2: Write data to the redirected location with correct headers
headers = {"Content-Type": "application/octet-stream"}
content = b"Hello from Ozone HTTPFS!\n"
resp_upload = requests.put(redirect_url, data=content, headers=headers)
if resp_upload.status_code != 201:
print(f"Upload failed: {resp_upload.status_code}")
print(resp_upload.text)
exit(1)
print("File created successfully.")
# Step 3: Read the file back
params_open = {
"op": "OPEN",
"user.name": user
}
print("Reading file...")
resp_read = requests.get(host + path, params=params_open, allow_redirects=True)
if resp_read.ok:
print("File contents:")
print(resp_read.text)
else:
print(f"Read failed: {resp_read.status_code}")
print(resp_read.text)
Try it yourself! Check out Access Ozone using HTTPFS REST API Tutorial for a quick start using Ozone’s Docker image.
Code Example (webhdfs)
First, install fsspec Python module:
pip install fsspec
from fsspec.implementations.webhdfs import WebHDFS
fs = WebHDFS(host='httpfs', port=14000, user='ozone')
# Read a file from /vol1/bucket1/hello.txt
file_path = "/vol1/bucket1/hello.txt"
with fs.open(file_path, mode='rb') as f:
content = f.read()
print("File contents:")
print(content.decode('utf-8'))
Note: Replace host, port, and path as per your setup.
Troubleshooting Tips
- Authentication Errors: Verify credentials and Kerberos tokens (if used).
- Connection Issues: Check endpoint URLs, ports, and firewall rules.
- FileSystem Errors: Ensure correct Ozone configuration and appropriate permissions.
- Missing Dependencies: Install required Python packages (
pip install pyarrow boto3 requests fsspec
).