可观察性

Ozone 提供了多种工具来获取有关集群当前状态的更多信息。

Prometheus

Ozone 原生支持 Prometheus 集成。所有内部指标(由 Hadoop 指标框架收集)都发布在 /prom 的 HTTP 端点下。(例如,在 SCM 的 http://localhost:9876/prom)。

Prometheus 端点默认是打开的,但可以通过hdds.prometheus.endpoint.enabled配置变量把它关闭。

在安全环境中,该页面是用 SPNEGO 认证来保护的,但 Prometheus 不支持这种认证。为了在安全环境中启用监控,可以配置一个特定的认证令牌。

ozone-site.xml 配置示例:

<property>
   <name>hdds.prometheus.endpoint.token</name>
   <value>putyourtokenhere</value>
</property>

prometheus 配置示例:

scrape_configs:
  - job_name: ozone
    bearer_token: <putyourtokenhere>
    metrics_path: /prom
    static_configs:
     - targets:
         - "127.0.0.1:9876" 

分布式跟踪

分布式跟踪可以通过可视化端到端的性能来帮助了解性能瓶颈。

Ozone 使用 jaeger 跟踪库收集跟踪,可以将跟踪数据发送到任何兼容的后端(Zipkin,…)。

默认情况下,跟踪功能是关闭的,可以通过 ozon-site.xmlhdds.tracing.enabled 配置变量打开。

<property>
   <name>hdds.tracing.enabled</name>
   <value>true</value>
</property>

Jaeger 客户端可以用环境变量进行配置,如这份文档所述。

例如:

JAEGER_SAMPLER_PARAM=0.01
JAEGER_SAMPLER_TYPE=probabilistic
JAEGER_AGENT_HOST=jaeger

此配置将记录1%的请求,以限制性能开销。有关 Jaeger 抽样的更多信息,请查看文档

Ozone Insight

Ozone Insight 是一个用于检查 Ozone 集群当前状态的工具,它可以显示特定组件的日志记录、指标和配置。

请使用ozone insight list命令检查可用的组件:

> ozone insight list

Available insight points:

  scm.node-manager                     SCM Datanode management related information.
  scm.replica-manager                  SCM closed container replication manager
  scm.event-queue                      Information about the internal async event delivery
  scm.protocol.block-location          SCM Block location protocol endpoint
  scm.protocol.container-location      SCM Container location protocol endpoint
  scm.protocol.security                SCM Block location protocol endpoint
  om.key-manager                       OM Key Manager
  om.protocol.client                   Ozone Manager RPC endpoint
  datanode.pipeline                    More information about one ratis datanode ring.

配置

ozone insight config 可以显示与特定组件有关的配置(只支持选定的组件)。

> ozone insight config scm.replica-manager

Configuration for `scm.replica-manager` (SCM closed container replication manager)

>>> hdds.scm.replication.thread.interval
       default: 300s
       current: 300s

There is a replication monitor thread running inside SCM which takes care of replicating the containers in the cluster. This property is used to configure the interval in which that thread runs.


>>> hdds.scm.replication.event.timeout
       default: 30m
       current: 30m

Timeout for the container replication/deletion commands sent  to datanodes. After this timeout the command will be retried.

指标

ozone insight metrics 可以显示与特定组件相关的指标(只支持选定的组件)。

> ozone insight metrics scm.protocol.block-location
Metrics for `scm.protocol.block-location` (SCM Block location protocol endpoint)

RPC connections

  Open connections: 0
  Dropped connections: 0
  Received bytes: 1267
  Sent bytes: 2420


RPC queue

  RPC average queue time: 0.0
  RPC call queue length: 0


RPC performance

  RPC processing time average: 0.0
  Number of slow calls: 0


Message type counters

  Number of AllocateScmBlock: ???
  Number of DeleteScmKeyBlocks: ???
  Number of GetScmInfo: ???
  Number of SortDatanodes: ???

日志

ozone insights logs 可以连接到所需的服务并显示与一个特定组件相关的DEBUG/TRACE日志。例如,显示RPC消息:

>ozone insight logs om.protocol.client

[OM] 2020-07-28 12:31:49,988 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol ServiceList request is received
[OM] 2020-07-28 12:31:50,095 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol CreateVolume request is received

使用 -v 标志,也可以显示 protobuf 信息的内容(TRACE级别的日志):

ozone insight logs -v om.protocol.client

[OM] 2020-07-28 12:33:28,463 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is received:
cmdType: CreateVolume
traceID: ""
clientId: "client-A31DF5C6ECF2"
createVolumeRequest {
  volumeInfo {
    adminName: "hadoop"
    ownerName: "hadoop"
    volume: "vol1"
    quotaInBytes: 1152921504606846976
    volumeAcls {
      type: USER
      name: "hadoop"
      rights: "200"
      aclScope: ACCESS
    }
    volumeAcls {
      type: GROUP
      name: "users"
      rights: "200"
      aclScope: ACCESS
    }
    creationTime: 1595939608460
    objectID: 0
    updateID: 0
    modificationTime: 0
  }
}

[OM] 2020-07-28 12:33:28,474 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is processed. Response:
cmdType: CreateVolume
traceID: ""
success: false
message: "Volume already exists"
status: VOLUME_ALREADY_EXISTS