Ozone 提供了多种工具来获取有关集群当前状态的更多信息。
Ozone 原生支持 Prometheus 集成。所有内部指标(由 Hadoop 指标框架收集)都发布在 /prom
的 HTTP 端点下。(例如,在 SCM 的 http://localhost:9876/prom)。
Prometheus 端点默认是打开的,但可以通过hdds.prometheus.endpoint.enabled
配置变量把它关闭。
在安全环境中,该页面是用 SPNEGO 认证来保护的,但 Prometheus 不支持这种认证。为了在安全环境中启用监控,可以配置一个特定的认证令牌。
ozone-site.xml
配置示例:
<property>
<name>hdds.prometheus.endpoint.token</name>
<value>putyourtokenhere</value>
</property>
prometheus 配置示例:
scrape_configs:
- job_name: ozone
bearer_token: <putyourtokenhere>
metrics_path: /prom
static_configs:
- targets:
- "127.0.0.1:9876"
分布式跟踪可以通过可视化端到端的性能来帮助了解性能瓶颈。
Ozone 使用 jaeger 跟踪库收集跟踪,可以将跟踪数据发送到任何兼容的后端(Zipkin,…)。
默认情况下,跟踪功能是关闭的,可以通过 ozon-site.xml
的 hdds.tracing.enabled
配置变量打开。
<property>
<name>hdds.tracing.enabled</name>
<value>true</value>
</property>
Jaeger 客户端可以用环境变量进行配置,如这份文档所述。
例如:
JAEGER_SAMPLER_PARAM=0.01
JAEGER_SAMPLER_TYPE=probabilistic
JAEGER_AGENT_HOST=jaeger
此配置将记录1%的请求,以限制性能开销。有关 Jaeger 抽样的更多信息,请查看文档。
Ozone Insight 是一个用于检查 Ozone 集群当前状态的工具,它可以显示特定组件的日志记录、指标和配置。
请使用ozone insight list
命令检查可用的组件:
> ozone insight list
Available insight points:
scm.node-manager SCM Datanode management related information.
scm.replica-manager SCM closed container replication manager
scm.event-queue Information about the internal async event delivery
scm.protocol.block-location SCM Block location protocol endpoint
scm.protocol.container-location SCM Container location protocol endpoint
scm.protocol.security SCM Block location protocol endpoint
om.key-manager OM Key Manager
om.protocol.client Ozone Manager RPC endpoint
datanode.pipeline More information about one ratis datanode ring.
ozone insight config
可以显示与特定组件有关的配置(只支持选定的组件)。
> ozone insight config scm.replica-manager
Configuration for `scm.replica-manager` (SCM closed container replication manager)
>>> hdds.scm.replication.thread.interval
default: 300s
current: 300s
There is a replication monitor thread running inside SCM which takes care of replicating the containers in the cluster. This property is used to configure the interval in which that thread runs.
>>> hdds.scm.replication.event.timeout
default: 30m
current: 30m
Timeout for the container replication/deletion commands sent to datanodes. After this timeout the command will be retried.
ozone insight metrics
可以显示与特定组件相关的指标(只支持选定的组件)。
> ozone insight metrics scm.protocol.block-location
Metrics for `scm.protocol.block-location` (SCM Block location protocol endpoint)
RPC connections
Open connections: 0
Dropped connections: 0
Received bytes: 1267
Sent bytes: 2420
RPC queue
RPC average queue time: 0.0
RPC call queue length: 0
RPC performance
RPC processing time average: 0.0
Number of slow calls: 0
Message type counters
Number of AllocateScmBlock: ???
Number of DeleteScmKeyBlocks: ???
Number of GetScmInfo: ???
Number of SortDatanodes: ???
ozone insights logs
可以连接到所需的服务并显示与一个特定组件相关的DEBUG/TRACE日志。例如,显示RPC消息:
>ozone insight logs om.protocol.client
[OM] 2020-07-28 12:31:49,988 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol ServiceList request is received
[OM] 2020-07-28 12:31:50,095 [DEBUG|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] OzoneProtocol CreateVolume request is received
使用 -v
标志,也可以显示 protobuf 信息的内容(TRACE级别的日志):
ozone insight logs -v om.protocol.client
[OM] 2020-07-28 12:33:28,463 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is received:
cmdType: CreateVolume
traceID: ""
clientId: "client-A31DF5C6ECF2"
createVolumeRequest {
volumeInfo {
adminName: "hadoop"
ownerName: "hadoop"
volume: "vol1"
quotaInBytes: 1152921504606846976
volumeAcls {
type: USER
name: "hadoop"
rights: "200"
aclScope: ACCESS
}
volumeAcls {
type: GROUP
name: "users"
rights: "200"
aclScope: ACCESS
}
creationTime: 1595939608460
objectID: 0
updateID: 0
modificationTime: 0
}
}
[OM] 2020-07-28 12:33:28,474 [TRACE|org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB|OzoneProtocolMessageDispatcher] [service=OzoneProtocol] [type=CreateVolume] request is processed. Response:
cmdType: CreateVolume
traceID: ""
success: false
message: "Volume already exists"
status: VOLUME_ALREADY_EXISTS
实际上 ozone insight
是通过 HTTP 端点来检索所需的信息(/conf
、/prom
和/logLevel
端点),它在安全环境中还不被支持。