Access Ozone using PyArrow (Docker Quickstart)

This tutorial demonstrates how to access Apache Ozone from Python using PyArrow, with Ozone running in Docker.

Prerequisites

Docker and Docker Compose installed.
Python 3.x environment.

Steps

1️⃣ Start Ozone in Docker

Download the latest Docker Compose file for Ozone and start the cluster with 3 DataNodes:

curl -O https://raw.githubusercontent.com/apache/ozone-docker/refs/heads/latest/docker-compose.yaml
docker compose up -d --scale datanode=3

2️⃣ Connect to the SCM Container

docker exec -it <your-scm-container-name-or-id> bash

Change the container id <your-scm-container-name-or-id> to your actual container id.

The rest of the tutorial will run on this container.

Create a volume and a bucket inside Ozone:

ozone sh volume create volume
ozone sh bucket create volume/bucket

3️⃣ Install PyArrow in Your Python Environment

pip install pyarrow

4️⃣ Download Hadoop Native Libraries for libhdfs Support

Depending on your system architecture, run one of the following:

For ARM64 (Apple Silicon, ARM servers):

curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*'

For x86_64 (most desktops and servers):

curl -L "https://www.apache.org/dyn/closer.lua?action=download&filename=hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz" | tar -xz --wildcards 'hadoop-3.4.0/lib/native/libhdfs.*'

Set environment variables to point to the native libraries and Ozone classpath:

export ARROW_LIBHDFS_DIR=hadoop-3.4.0/lib/native/
export CLASSPATH=$(ozone classpath ozone-tools)

5️⃣ Configure Core-Site.xml

Add the following to /etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>ofs://om:9862</value>
        <description>Ozone Manager endpoint</description>
    </property>
</configuration>

Note: the Docker container has environment variable OZONE_CONF_DIR=/etc/hadoop/ so it knows where to locate the configuration files.

6️⃣ Access Ozone Using PyArrow

Create a Python script (ozone_pyarrow_example.py) with the following code:

#!/usr/bin/python
import pyarrow.fs as pafs

# Connect to Ozone using HadoopFileSystem
# "default" tells PyArrow to use the fs.defaultFS property from core-site.xml
fs = pafs.HadoopFileSystem("default")

# Create a directory inside the bucket
fs.create_dir("volume/bucket/aaa")

# Write data to a file
path = "volume/bucket/file1"
with fs.open_output_stream(path) as stream:
    stream.write(b'data')

Run the script:

python ozone_pyarrow_example.py

✅ Congratulations! You’ve successfully accessed Ozone from Python using PyArrow and Docker.

Troubleshooting Tips

libhdfs Errors: Ensure ARROW_LIBHDFS_DIR is set and points to the correct native library path.
Connection Issues: Verify the Ozone Manager endpoint (om:9862) is correct and reachable.
Permissions: Ensure your Ozone user has the correct permissions for the volume and bucket.

References

Next >>