Appearance
简介
使用 deepseek-v4-pro 搭建在本地 vmware 虚拟机中,带有 kerberos 认证
目录
- 环境准备
- MySQL 安装
- Kerberos KDC 搭建
- ZooKeeper 集群
- Hadoop 集群 (HA + Kerberos)
- Hive Metastore
- Kafka 集群 (KRaft)
- Spark 配置
- Flink 配置
- Kyuubi 配置
- DolphinScheduler
- HDFS 权限体系
- 一键启停脚本
- 验证清单
1. 环境准备
1.1 节点规划
| 节点 | IP | 角色 |
|---|---|---|
| node1 | 192.168.164.131 | KDC, MySQL, NameNode(Active), ZKFC, JournalNode, Kyuubi, DolphinScheduler Master/API/Worker |
| node2 | 192.168.164.132 | NameNode(Standby), ZKFC, ResourceManager, Hive Metastore, DolphinScheduler Worker |
| node3 | 192.168.164.133 | DataNode, JournalNode, NodeManager, ZooKeeper, Kafka, DolphinScheduler Worker |
| node4 | 192.168.164.134 | DataNode, JournalNode, NodeManager, ZooKeeper, Kafka, DolphinScheduler Worker |
| node5 | 192.168.164.135 | DataNode, NodeManager, ZooKeeper, Kafka, DolphinScheduler Worker |
说明: JournalNode 在 node3/4/5 三节点上运行,满足 HA 所需的奇数个 Quorum Journal Manager(QJM)要求。三个 JournalNode 已足够提供容错,允许 1 个节点故障仍可继续写入 EditLog。
1.2 主机名与 hosts 配置
目的: 确保集群内所有节点可通过主机名互相访问,这是 Hadoop/Kerberos/Kafka 等组件的基本要求。如果主机名无法解析,RPC 通信将失败。
在所有节点上执行:
bash
# 设置主机名 (每个节点分别执行)
sudo hostnamectl set-hostname node1 # 在 node1 上
sudo hostnamectl set-hostname node2 # 在 node2 上
sudo hostnamectl set-hostname node3 # 在 node3 上
sudo hostnamectl set-hostname node4 # 在 node4 上
sudo hostnamectl set-hostname node5 # 在 node5 上
# 在所有节点上编辑 /etc/hosts,追加以下内容
sudo tee -a /etc/hosts << 'EOF'
192.168.164.131 node1
192.168.164.132 node2
192.168.164.133 node3
192.168.164.134 node4
192.168.164.135 node5
EOF验证主机名解析:
bash
ping -c 1 node2
ping -c 1 node3
ping -c 1 node4
ping -c 1 node51.3 SSH 免密登录配置
目的: Hadoop 的启动/停止脚本(start-dfs.sh、stop-yarn.sh 等)通过 SSH 在各节点间分发命令,必须配置 SSH 免密登录才能无交互执行。此外,集群分发脚本也依赖 SSH。
步骤1 — 在 node1 上生成 SSH 密钥对:
bash
# 生成 4096 位 RSA 密钥(无密码保护,适合自动化场景)
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa -N "" -q步骤2 — 将公钥分发到所有节点(包括 node1 自身):
bash
# 循环分发,StrictHostKeyChecking=no 跳过首次连接确认
for host in node1 node2 node3 node4 node5; do
ssh-copy-id -o StrictHostKeyChecking=no causes@$host
done注意: 执行时需要输入各节点 causes 用户的登录密码。如果各节点密码不同,需提前统一或分步执行。
步骤3 — 验证免密登录:
bash
for host in node1 node2 node3 node4 node5; do
ssh -o StrictHostKeyChecking=no $host "hostname"
done预期输出:依次打印 node1 到 node5 的主机名,全程无需输入密码。
1.4 JDK 安装验证与环境变量配置
说明: JDK 8 (8u492) 和 JDK 17 (17.0.19) 已解压到 /opt/module/jdk8 和 /opt/module/jdk17。本集群主要使用 JDK 8(Hadoop/Hive/Spark/Flink/Kafka 均兼容 JDK 8),JDK 17 为可选备用。
步骤1 — 验证 JDK 安装:
bash
/opt/module/jdk8/bin/java -version
/opt/module/jdk17/bin/java -version预期输出:
# JDK 8
java version "1.8.0_492"
Java(TM) SE Runtime Environment (build 1.8.0_492-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25.492-b08, mixed mode)
# JDK 17
java version "17.0.19" 2025-04-15 LTS
Java(TM) SE Runtime Environment (build 17.0.19+8-LTS-549)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.19+8-LTS-549, mixed mode, sharing)步骤2 — 创建全局环境变量文件(在所有节点上执行):
/etc/profile.d/custom_profile.sh 会在每次 shell 登录时自动加载,无需手动 source .bashrc。
bash
sudo tee /etc/profile.d/custom_profile.sh << 'EOF'
# ============================================================
# 大数据平台全局环境变量
# ============================================================
# --- JDK ---
export JAVA_HOME=/opt/module/jdk8
export JDK17_HOME=/opt/module/jdk17
export PATH=$JAVA_HOME/bin:$JDK17_HOME/bin:$PATH
# --- 大数据组件根目录 ---
export HADOOP_HOME=/opt/module/hadoop
export HIVE_HOME=/opt/module/hive
export ZOOKEEPER_HOME=/opt/module/zookeeper
export KAFKA_HOME=/opt/module/kafka
export SPARK_HOME=/opt/module/spark
export FLINK_HOME=/opt/module/flink
export KYUUBI_HOME=/opt/module/kyuubi
# --- PATH 扩展 ---
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export PATH=$HIVE_HOME/bin:$PATH
export PATH=$ZOOKEEPER_HOME/bin:$PATH
export PATH=$KAFKA_HOME/bin:$PATH
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PATH=$FLINK_HOME/bin:$PATH
export PATH=$KYUUBI_HOME/bin:$PATH
# --- 临时目录 ---
export TMPDIR=/tmp
# --- Kerberos 票据缓存 ---
export KRB5CCNAME=FILE:/tmp/krb5cc_%{uid}
EOF
# 立即生效
source /etc/profile.d/custom_profile.sh步骤3 — 分发环境变量文件到所有节点并验证:
bash
for host in node2 node3 node4 node5; do
sudo scp /etc/profile.d/custom_profile.sh $host:/etc/profile.d/
ssh $host "source /etc/profile.d/custom_profile.sh && java -version 2>&1 | head -1"
done预期:每个节点均输出 JDK 8 的版本信息。
1.5 统一目录结构创建
目的: 所有节点的数据和日志目录结构保持一致,便于运维管理和脚本自动化。
在所有节点上执行:
bash
# 创建目录 (使用 sudo 因为 /opt 可能属于 root)
sudo mkdir -p /opt/data/{hadoop,hive,kafka,zookeeper,spark,flink,kyuubi,mysql}
sudo mkdir -p /opt/config/{hadoop,hive,kafka,zookeeper,spark,flink,kyuubi,kerberos}
sudo mkdir -p /opt/script
sudo mkdir -p /opt/module
sudo mkdir -p /opt/software
# 将所有权赋予 causes 用户,避免后续操作频繁使用 sudo
sudo chown -R causes:causes /opt/data /opt/config /opt/script /opt/module /opt/software
sudo chmod -R 755 /opt/data /opt/config /opt/script /opt/module /opt/software1.6 集群分发脚本 (xsync)
目的: 编写一个通用的文件分发脚本,通过 scp 将文件或目录同步到所有远程节点。这是集群运维中最常用的工具之一。
仅在 node1 上创建 /opt/script/xsync.sh:
bash
cat > /opt/script/xsync.sh << 'SCRIPT_EOF'
#!/bin/bash
# ================================================================
# 集群文件分发脚本 xsync
# 用法: xsync.sh <要分发的文件或目录路径>
# 示例: xsync.sh /opt/module/hadoop/etc/hadoop/core-site.xml
# xsync.sh /opt/module/hive/conf/hive-site.xml
# 原理: 使用 scp 将指定的文件/目录复制到所有远程节点的相同路径
# ================================================================
if [ $# -lt 1 ]; then
echo "用法: $0 <要分发的文件或目录>"
echo "示例: $0 /opt/module/hadoop/etc/hadoop/core-site.xml"
exit 1
fi
TARGET=$1
TARGET_PARENT=$(dirname $TARGET)
# 远程节点列表 (不含当前节点 node1)
SERVERS="node2 node3 node4 node5"
for server in $SERVERS; do
echo "========== 同步到 $server =========="
# 确保远程目标父目录存在
ssh $server "mkdir -p $TARGET_PARENT"
# 递归复制文件或目录
scp -r $TARGET $server:$TARGET_PARENT/
if [ $? -eq 0 ]; then
echo " [$server] 同步成功 ✓"
else
echo " [$server] 同步失败 ✗"
fi
done
echo "========== 分发完成 =========="
SCRIPT_EOF
chmod +x /opt/script/xsync.sh说明: 该脚本仅在 node1 管理节点上使用,不需要分发到其他节点。
2. MySQL 安装
2.1 MySQL 架构说明
说明: MySQL 8.0 通过 Docker 容器运行在 node1 上,以 Docker 方式部署的好处:
- 不污染宿主机环境,升级/卸载方便
- 通过
--restart=always保证主机重启后自动恢复 - 数据通过 Volume 挂载实现持久化,容器删除不影响数据
MySQL 为以下组件提供元数据存储:
- Hive Metastore: 存储 Hive 表结构、分区信息、列统计等
- DolphinScheduler: 存储工作流定义、任务实例、调度状态等
2.2 Docker 安装 (node1)
bash
# 安装依赖包
sudo apt-get update
sudo apt-get install -y ca-certificates curl
# 添加 Docker 官方 GPG 密钥 (Ubuntu 24.04)
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# 添加 Docker APT 仓库
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# 安装 Docker Engine
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# 将 causes 用户加入 docker 组(避免每次使用 sudo docker)
sudo usermod -aG docker causes
# 重新登录使组权限生效,或临时使用
newgrp docker2.3 MySQL 容器部署
步骤1 — 创建持久化目录和自定义配置:
bash
# 创建持久化目录
mkdir -p /opt/data/mysql/{data,conf,logs}
# 创建 MySQL 自定义配置文件 (限制内存使用,适配 8GB 节点)
cat > /opt/data/mysql/conf/my.cnf << 'EOF'
[mysqld]
# 字符集
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
# 认证插件 (兼容老版本客户端)
default_authentication_plugin=mysql_native_password
# 连接数
max_connections=500
# 包大小
max_allowed_packet=256M
# InnoDB 缓冲池 (限制为 256MB,适配小内存)
innodb_buffer_pool_size=256M
innodb_log_file_size=128M
# 允许在函数中创建触发器
log_bin_trust_function_creators=1
# SQL 模式
sql_mode=STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
EOF步骤2 — 启动 MySQL 容器:
bash
docker run -d \
--name mysql8 \
--restart=always \
-p 3306:3306 \
-v /opt/data/mysql/data:/var/lib/mysql \
-v /opt/data/mysql/conf/my.cnf:/etc/mysql/conf.d/my.cnf \
-v /opt/data/mysql/logs:/var/log/mysql \
-e MYSQL_ROOT_PASSWORD=Root@2026! \
mysql:8.0.41步骤3 — 等待 MySQL 就绪:
bash
# 检查容器运行状态
docker ps | grep mysql8
# 轮询直到 MySQL 可以接受连接
until docker exec mysql8 mysqladmin ping -u root -pRoot@2026! --silent; do
echo "等待 MySQL 启动..."
sleep 2
done
echo "MySQL 已就绪,可以接受连接"预期输出:
CONTAINER ID IMAGE STATUS PORTS NAMES
a1b2c3d4e5f6 mysql:8.0.41 Up 15 seconds 0.0.0.0:3306->3306/tcp, 33060/tcp mysql8
MySQL 已就绪,可以接受连接2.4 数据库与用户创建
目的: 为 Hive Metastore 和 DolphinScheduler 创建独立数据库和专用用户,遵循最小权限原则,各组件使用独立账户互不干扰。
bash
docker exec -i mysql8 mysql -u root -pRoot@2026! << 'SQL_EOF'
-- ============================================================
-- 1. Hive Metastore 数据库
-- ============================================================
CREATE DATABASE IF NOT EXISTS hive_metastore
DEFAULT CHARACTER SET utf8mb4
DEFAULT COLLATE utf8mb4_unicode_ci;
CREATE USER IF NOT EXISTS 'hive'@'%' IDENTIFIED BY 'Hive@2026!';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hive'@'%';
-- ============================================================
-- 2. DolphinScheduler 数据库
-- ============================================================
CREATE DATABASE IF NOT EXISTS dolphinscheduler
DEFAULT CHARACTER SET utf8mb4
DEFAULT COLLATE utf8mb4_unicode_ci;
CREATE USER IF NOT EXISTS 'ds'@'%' IDENTIFIED BY 'Dolphin@2026!';
GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'ds'@'%';
-- ============================================================
-- 3. 刷新权限
-- ============================================================
FLUSH PRIVILEGES;
-- 验证结果
SHOW DATABASES;
SELECT user, host FROM mysql.user WHERE user IN ('hive', 'ds');
SQL_EOF预期输出中包含 hive_metastore 和 dolphinscheduler 两个数据库,以及 hive 和 ds 两个用户。
远程连接验证 (从 node2 或其他节点):
bash
# 在 node2 上安装 MySQL 客户端
sudo apt-get install -y mysql-client
# 测试 Hive 用户的远程连接
mysql -h node1 -P 3306 -u hive -pHive@2026! -e "SELECT VERSION();"
# 预期输出:
# +-----------+
# | VERSION() |
# +-----------+
# | 8.0.41 |
# +-----------+3. Kerberos KDC 搭建
3.1 Kerberos 概念介绍
Kerberos 是一个基于票据(Ticket)的网络认证协议,为整个大数据平台提供统一的身份认证服务。
核心概念:
- Realm(域): 认证的边界,本集群使用
BIGDATA.CLUSTER - KDC (Key Distribution Center): 密钥分发中心,包括 AS(认证服务器)和 TGS(票据授予服务器)
- Principal(主体): Kerberos 中的身份标识,格式为
service/hostname@REALM或username@REALM - Keytab(密钥表文件): 包含 Principal 加密密钥的文件,用于服务免交互认证
- Ticket(票据): 客户端认证成功后获取的临时凭证,有时效性
认证流程简述:
- 客户端使用
kinit获取 TGT(票据授予票据) - 客户端使用 TGT 向 KDC 请求特定服务的 Service Ticket
- 客户端将 Service Ticket 发送给目标服务进行认证
3.2 安装 Kerberos 软件包
在所有节点上安装必要的软件包:
bash
export DEBIAN_FRONTEND=noninteractive
# node1 上安装 KDC 服务器和管理工具
sudo apt-get update
sudo apt-get install -y krb5-kdc krb5-admin-server krb5-config krb5-user
# node2/3/4/5 上仅安装客户端
for host in node2 node3 node4 node5; do
ssh $host "sudo apt-get update && sudo apt-get install -y krb5-user libpam-krb5"
done注意: 安装
krb5-kdc时可能出现 ncurses 交互配置界面,要求输入:
- Default Kerberos version 5 realm:
BIGDATA.CLUSTER- Kerberos servers for your realm:
node1- Administrative server for your Kerberos realm:
node1如果已经跳过了或填写错误,后续可以通过直接编辑
/etc/krb5.conf修正。
3.3 krb5.conf —— 客户端核心配置
/etc/krb5.conf 是所有 Kerberos 客户端(包括 KDC 自身)的核心配置文件,定义了:
- 默认 Realm
- KDC 和 Admin Server 的地址
- 域名与 Realm 的映射
- 票据生命周期
在 node1 上创建,后续分发到所有节点:
bash
sudo tee /etc/krb5.conf << 'EOF'
[libdefaults]
default_realm = BIGDATA.CLUSTER
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
rdns = false
pkinit_anchors = FILE:/etc/ssl/certs/ca-certificates.crt
default_ccache_name = FILE:/tmp/krb5cc_%{uid}
# 强制使用 TCP 协议 (避免大数据传输时 UDP 分包问题)
udp_preference_limit = 1
[realms]
BIGDATA.CLUSTER = {
kdc = node1:88
admin_server = node1:749
default_domain = bigdata.cluster
}
[domain_realm]
.bigdata.cluster = BIGDATA.CLUSTER
bigdata.cluster = BIGDATA.CLUSTER
node1 = BIGDATA.CLUSTER
node2 = BIGDATA.CLUSTER
node3 = BIGDATA.CLUSTER
node4 = BIGDATA.CLUSTER
node5 = BIGDATA.CLUSTER
[logging]
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmin.log
default = FILE:/var/log/krb5lib.log
EOF配置要点说明:
ticket_lifetime=24h: TGT 票据有效期为 24 小时,适合日常运维renew_lifetime=7d: 票据可续期的最大时限为 7 天,超时后需重新 kinitudp_preference_limit=1: 强制使用 TCP。Kafka/Hadoop 通信数据可能超过 UDP 单包限制(64KB),TCP 更可靠
3.4 kdc.conf —— KDC 服务专属配置 (仅 node1)
/etc/krb5kdc/kdc.conf 是 KDC 服务的专属配置文件,定义:
- 数据库类型和存放路径
- 票据最大生命周期
- 支持的加密算法
bash
sudo tee /etc/krb5kdc/kdc.conf << 'EOF'
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
BIGDATA.CLUSTER = {
# 数据库主密钥加密类型
master_key_type = aes256-cts-hmac-sha384-192
# ACL 文件路径
acl_file = /etc/krb5kdc/kadm5.acl
# Admin Server 使用的 keytab
admin_keytab = /etc/krb5kdc/kadm5.keytab
# 支持的加密类型(从强到弱)
supported_enctypes = aes256-cts-hmac-sha384-192:normal aes128-cts-hmac-sha256-128:normal aes256-cts-hmac-sha1-96:normal aes128-cts-hmac-sha1-96:normal
# 票据最大生命周期
max_life = 24h 0m 0s
max_renewable_life = 7d 0h 0m 0s
# 主体数据库文件路径
database_name = /var/lib/krb5kdc/principal
# 默认要求预认证
default_principal_flags = +preauth
}
EOF加密类型说明:
aes256-cts-hmac-sha384-192是最强的 Kerberos 加密算法,兼容性最好,Java 8u161+ 和 JDK 17 均原生支持。
3.5 kadm5.acl —— 管理员访问控制 (仅 node1)
bash
sudo tee /etc/krb5kdc/kadm5.acl << 'EOF'
# 格式: principal 权限掩码
# * 表示所有权限 (add/del/mod/get)
causes/admin@BIGDATA.CLUSTER *
*/admin@BIGDATA.CLUSTER *
EOF3.6 初始化 KDC 数据库 (仅 node1)
bash
# 创建 Kerberos 数据库(设置数据库主密钥密码)
sudo /usr/sbin/krb5_newrealm
# 交互式输入:
# Enter KDC database master key: Causes@2026 <-- 设置并记住此密码
# Re-enter KDC database master key: Causes@2026
# 启动 KDC 和 Admin Server 并设为开机自启
sudo systemctl enable krb5-kdc krb5-admin-server
sudo systemctl start krb5-kdc krb5-admin-server
sudo systemctl status krb5-kdc krb5-admin-server预期输出:
● krb5-kdc.service - Kerberos 5 Key Distribution Center
Loaded: loaded (/lib/systemd/system/krb5-kdc.service; enabled)
Active: active (running) since ...; ...
● krb5-admin-server.service - Kerberos 5 Admin Server
Loaded: loaded (/lib/systemd/system/krb5-admin-server.service; enabled)
Active: active (running) since ...; ...3.7 创建 Kerberos 主体 (Principal)
目的: 为集群中的每个服务和用户创建对应的 Kerberos 主体(Principal),这是认证的基础。
主体命名规范:
- 服务主体:
service/hostname@REALM(如nn/node1@BIGDATA.CLUSTER) - 用户主体:
username@REALM(如hdfs@BIGDATA.CLUSTER)
在 KDC 服务器 (node1) 上使用 kadmin.local 创建所有主体:
bash
sudo kadmin.local << 'KADMIN_EOF'
# ============================================================
# 1. 管理员主体
# ============================================================
addprinc -pw Causes@2026 causes/admin
# ============================================================
# 2. HDFS 服务主体
# ============================================================
# NameNode 主体 (node1, node2)
addprinc -pw hadoop123 -randkey nn/node1@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey nn/node2@BIGDATA.CLUSTER
# DataNode 主体 (node3, node4, node5)
addprinc -pw hadoop123 -randkey dn/node3@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey dn/node4@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey dn/node5@BIGDATA.CLUSTER
# JournalNode 主体 (node3, node4, node5)
addprinc -pw hadoop123 -randkey jn/node3@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey jn/node4@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey jn/node5@BIGDATA.CLUSTER
# HTTP 服务主体 (SPNEGO Web 认证)
addprinc -pw hadoop123 -randkey HTTP/node1@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey HTTP/node2@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey HTTP/node3@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey HTTP/node4@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey HTTP/node5@BIGDATA.CLUSTER
# ============================================================
# 3. YARN 服务主体
# ============================================================
# ResourceManager (node2)
addprinc -pw hadoop123 -randkey rm/node2@BIGDATA.CLUSTER
# NodeManager (node3, node4, node5)
addprinc -pw hadoop123 -randkey nm/node3@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey nm/node4@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey nm/node5@BIGDATA.CLUSTER
# ============================================================
# 4. Hive 服务主体 (node2)
# ============================================================
addprinc -pw hadoop123 -randkey hive/node2@BIGDATA.CLUSTER
# ============================================================
# 5. Kafka 服务主体 (node3, node4, node5)
# ============================================================
addprinc -pw hadoop123 -randkey kafka/node3@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey kafka/node4@BIGDATA.CLUSTER
addprinc -pw hadoop123 -randkey kafka/node5@BIGDATA.CLUSTER
# ============================================================
# 6. 用户主体
# ============================================================
addprinc -pw causes123 hdfs
addprinc -pw causes123 yarn
addprinc -pw causes123 hive
addprinc -pw causes123 spark
addprinc -pw causes123 flink
addprinc -pw causes123 kafka
addprinc -pw causes123 kyubiuser
addprinc -pw causes123 causes@BIGDATA.CLUSTER
KADMIN_EOF验证主体创建结果:
bash
sudo kadmin.local -q "listprincs"预期输出包含所有上述创建的主体,例如 nn/node1@BIGDATA.CLUSTER, hdfs@BIGDATA.CLUSTER 等。
3.8 生成 Keytab 文件
目的: Keytab 文件是服务端的"密码文件",服务进程在启动时读取 keytab 获取 Kerberos 票据,无需人工交互输入密码。这是生产环境中自动化运行的基础。
bash
# 创建 keytab 存放目录
sudo mkdir -p /opt/config/kerberos/keytabs
sudo chown -R causes:causes /opt/config/kerberos
# 使用 kadmin.local 批量生成 keytab
sudo kadmin.local << 'KADMIN_EOF'
# ============================================================
# HDFS 相关 keytab
# ============================================================
# NameNode keytab (含 node1 和 node2 的两个主体)
ktadd -k /opt/config/kerberos/keytabs/nn.service.keytab nn/node1@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/nn.service.keytab nn/node2@BIGDATA.CLUSTER
# DataNode keytab (含三个 DataNode 主体)
ktadd -k /opt/config/kerberos/keytabs/dn.service.keytab dn/node3@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/dn.service.keytab dn/node4@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/dn.service.keytab dn/node5@BIGDATA.CLUSTER
# JournalNode keytab
ktadd -k /opt/config/kerberos/keytabs/jn.service.keytab jn/node3@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/jn.service.keytab jn/node4@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/jn.service.keytab jn/node5@BIGDATA.CLUSTER
# HTTP 服务 keytab (SPNEGO Web 认证)
ktadd -k /opt/config/kerberos/keytabs/HTTP.keytab HTTP/node1@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/HTTP.keytab HTTP/node2@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/HTTP.keytab HTTP/node3@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/HTTP.keytab HTTP/node4@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/HTTP.keytab HTTP/node5@BIGDATA.CLUSTER
# ============================================================
# YARN 相关 keytab
# ============================================================
ktadd -k /opt/config/kerberos/keytabs/rm.service.keytab rm/node2@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/nm.service.keytab nm/node3@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/nm.service.keytab nm/node4@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/nm.service.keytab nm/node5@BIGDATA.CLUSTER
# ============================================================
# Hive keytab
# ============================================================
ktadd -k /opt/config/kerberos/keytabs/hive.service.keytab hive/node2@BIGDATA.CLUSTER
# ============================================================
# Kafka keytab
# ============================================================
ktadd -k /opt/config/kerberos/keytabs/kafka.service.keytab kafka/node3@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/kafka.service.keytab kafka/node4@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/kafka.service.keytab kafka/node5@BIGDATA.CLUSTER
# ============================================================
# 用户 keytab
# ============================================================
ktadd -k /opt/config/kerberos/keytabs/hdfs.keytab hdfs@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/hdfs.keytab hdfs/node1@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/yarn.keytab yarn@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/hive.keytab hive@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/spark.keytab spark@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/flink.keytab flink@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/kafka.keytab kafka@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/kyubiuser.keytab kyubiuser@BIGDATA.CLUSTER
ktadd -k /opt/config/kerberos/keytabs/causes.keytab causes@BIGDATA.CLUSTER
KADMIN_EOF
# 设置 keytab 文件安全权限(仅 owner 可读写)
sudo chown -R causes:causes /opt/config/kerberos/keytabs
sudo chmod 600 /opt/config/kerberos/keytabs/*.keytab验证 keytab 内容(查看其中包含的主体列表):
bash
# 查看 nn.service.keytab 中的主体
klist -kt /opt/config/kerberos/keytabs/nn.service.keytab
# 预期: 列出 nn/node1@BIGDATA.CLUSTER 和 nn/node2@BIGDATA.CLUSTER
klist -kt /opt/config/kerberos/keytabs/dn.service.keytab
# 预期: 列出三个 DataNode 主体3.9 分发 Kerberos 配置与 Keytab
策略说明: keytab 文件采用最小分发原则——每个节点只获取该节点上服务所需的主体 keytab。
bash
# -------- 分发 krb5.conf 到所有节点 --------
for host in node2 node3 node4 node5; do
sudo scp /etc/krb5.conf $host:/etc/
done
# -------- HDFS 相关 keytab 分发 --------
# NameNode keytab -> node1 已存在, node2
scp /opt/config/kerberos/keytabs/nn.service.keytab node2:/opt/config/kerberos/keytabs/
# DataNode keytab -> node3, node4, node5
for host in node3 node4 node5; do
scp /opt/config/kerberos/keytabs/dn.service.keytab $host:/opt/config/kerberos/keytabs/
done
# JournalNode keytab -> node3, node4, node5
for host in node3 node4 node5; do
scp /opt/config/kerberos/keytabs/jn.service.keytab $host:/opt/config/kerberos/keytabs/
done
# HTTP keytab -> 所有节点 (每个节点都需要 SPNEGO)
for host in node2 node3 node4 node5; do
scp /opt/config/kerberos/keytabs/HTTP.keytab $host:/opt/config/kerberos/keytabs/
done
# -------- YARN 相关 keytab 分发 --------
# RM keytab -> node2
scp /opt/config/kerberos/keytabs/rm.service.keytab node2:/opt/config/kerberos/keytabs/
# NM keytab -> node3, node4, node5
for host in node3 node4 node5; do
scp /opt/config/kerberos/keytabs/nm.service.keytab $host:/opt/config/kerberos/keytabs/
done
# -------- Hive keytab -> node2 --------
scp /opt/config/kerberos/keytabs/hive.service.keytab node2:/opt/config/kerberos/keytabs/
# -------- Kafka keytab -> node3, node4, node5 --------
for host in node3 node4 node5; do
scp /opt/config/kerberos/keytabs/kafka.service.keytab $host:/opt/config/kerberos/keytabs/
done
# -------- 用户 keytab -> 所有节点 (方便各节点执行命令) --------
for host in node2 node3 node4 node5; do
scp /opt/config/kerberos/keytabs/{hdfs,yarn,hive,spark,flink,kafka,kyubiuser,causes}.keytab $host:/opt/config/kerberos/keytabs/
done
# -------- 在所有远程节点上设置 keytab 权限 --------
for host in node2 node3 node4 node5; do
ssh $host "sudo chown -R causes:causes /opt/config/kerberos/keytabs && sudo chmod 600 /opt/config/kerberos/keytabs/*.keytab"
done3.10 验证 Kerberos 认证
bash
# 测试一: 用户密码认证
echo "causes123" | kinit causes@BIGDATA.CLUSTER
klist
# 预期: 显示 causes 的 TGT 票据
# 测试二: Keytab 免交互认证
kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs@BIGDATA.CLUSTER
klist
# 预期: 显示 hdfs 的 TGT 票据
# 测试三: 清除票据
kdestroy
klist
# 预期: klist: No credentials cache found预期 klist 输出示例:
Ticket cache: FILE:/tmp/krb5cc_1000
Default principal: hdfs@BIGDATA.CLUSTER
Valid starting Expires Service principal
06/18/26 10:00:00 06/19/26 10:00:00 krbtgt/BIGDATA.CLUSTER@BIGDATA.CLUSTER4. ZooKeeper 集群
4.1 ZooKeeper 架构说明
说明: ZooKeeper 3.9.5 以三节点 ensemble 模式部署在 node3/4/5 上。ZooKeeper 在集群中承担以下角色:
- Hadoop HA: 存储 Active/Standby NameNode 的状态信息,支持自动故障转移
- YARN HA (可选): 存储 ResourceManager 状态
- Flink HA: 存储 JobManager 元数据
- DolphinScheduler: 服务注册与发现
三节点容忍度: 3 节点 ensemble 允许最多 1 个节点故障,仍能正常提供仲裁服务。
4.2 确认 ZooKeeper 安装
bash
# 确认 ZooKeeper 已解压到 /opt/module/zookeeper
ls /opt/module/zookeeper/
# 预期包含: bin, conf, lib, docs, licenses 等目录
# 如果尚未解压,执行以下命令:
# cd /opt/software/
# tar -xzf apache-zookeeper-3.9.5-bin.tar.gz -C /opt/module/
# mv /opt/module/apache-zookeeper-3.9.5-bin /opt/module/zookeeper4.3 zoo.cfg —— 集群配置文件
说明: zoo.cfg 是 ZooKeeper 的核心配置文件,定义 tick 时间、数据目录、集群节点列表等。
在 node3 上创建(后续分发到 node4/5):
bash
cat > /opt/module/zookeeper/conf/zoo.cfg << 'EOF'
# ============================================================
# ZooKeeper 3.9.5 集群配置
# ============================================================
# tick 时间基准(毫秒),所有超时配置的计量单位
# session 最小超时 = 2 * tickTime, 最大超时 = 20 * tickTime
tickTime=2000
# Follower 连接 Leader 的初始化超时(tick 单位)
# 超时时间 = initLimit * tickTime = 20 秒
initLimit=10
# Follower 与 Leader 同步的最大延迟(tick 单位)
# 超时时间 = syncLimit * tickTime = 10 秒
syncLimit=5
# 快照数据目录
dataDir=/opt/data/zookeeper
# 事务日志目录(与快照分离以提高性能,减少磁盘 IO 竞争)
dataLogDir=/opt/data/zookeeper/logs
# 客户端连接端口
clientPort=2181
# 单个客户端 IP 最大连接数
maxClientCnxns=100
# 自动清理旧快照(保留最近 10 个,每 4 小时检查一次)
autopurge.snapRetainCount=10
autopurge.purgeInterval=4
# ============================================================
# 集群节点配置 (3 节点 ensemble)
# 格式: server.X=hostname:FollowerPort:ElectionPort[:role]
# FollowerPort: Follower 与 Leader 通信端口 (2888)
# ElectionPort: Leader 选举端口 (3888)
# ============================================================
server.3=node3:2888:3888
server.4=node4:2888:3888
server.5=node5:2888:3888
EOF配置参数详解:
tickTime=2000(2秒): ZooKeeper 的基本时间单位。Session 超时的下限为 4 秒 (2tickTime),上限为 40 秒 (20tickTime)initLimit=10: Follower 在 20 秒内必须完成与 Leader 的初始连接syncLimit=5: Follower 与 Leader 的数据延迟超过 10 秒时将被从集群中剔除dataDir与dataLogDir分离:事务日志是顺序写入,快照是随机写入,分目录存放减少同一磁盘的 IO 竞争
4.4 创建 myid 文件
说明: 每个 ZooKeeper 节点需要一个 myid 文件,内容为该节点在集群中的唯一数字 ID(必须与 zoo.cfg 中 server.X 的 X 一致)。
bash
# node3: myid = 3
mkdir -p /opt/data/zookeeper/logs
echo "3" > /opt/data/zookeeper/myid
# node4: myid = 4
ssh node4 "mkdir -p /opt/data/zookeeper/logs && echo '4' > /opt/data/zookeeper/myid"
# node5: myid = 5
ssh node5 "mkdir -p /opt/data/zookeeper/logs && echo '5' > /opt/data/zookeeper/myid"
# 统一设置目录所有者
for host in node3 node4 node5; do
ssh $host "chown -R causes:causes /opt/data/zookeeper"
done4.5 分发 ZooKeeper 到其他节点
bash
# 分发 ZooKeeper 安装目录和配置文件到 node4, node5
for host in node4 node5; do
scp -r /opt/module/zookeeper $host:/opt/module/
scp /opt/module/zookeeper/conf/zoo.cfg $host:/opt/module/zookeeper/conf/
done4.6 启动 ZooKeeper 集群
bash
# 依次在三节点上启动 ZooKeeper
for host in node3 node4 node5; do
ssh $host "/opt/module/zookeeper/bin/zkServer.sh start"
echo " [$host] 启动完成"
done每个节点预期输出:
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED4.7 验证 ZooKeeper 集群状态
步骤1 — 检查每个节点的角色:
bash
for host in node3 node4 node5; do
echo "========== $host =========="
ssh $host "/opt/module/zookeeper/bin/zkServer.sh status"
done预期输出: 一个节点显示 Mode: leader,其余两个显示 Mode: follower。
步骤2 — 客户端读写测试:
bash
/opt/module/zookeeper/bin/zkCli.sh -server node3:2181 << 'EOF'
# 查看根节点
ls /
# 创建测试节点
create /test "hello_zookeeper"
# 读取测试节点
get /test
# 删除测试节点
delete /test
# 退出
quit
EOF预期输出:
Created /test
hello_zookeeper步骤3 — 查看 Java 进程:
bash
for host in node3 node4 node5; do
echo "=== $host ==="
ssh $host "jps | grep QuorumPeer"
done
# 预期: 每个节点各有一个 QuorumPeerMain 进程5. Hadoop 集群 (HA + Kerberos)
5.1 Hadoop HA 架构详解
说明: 本节搭建的 Hadoop 3.4.0 HA 集群是后续所有上层组件的基础,其架构如下:
┌─────────────────────┐
│ ZooKeeper 集群 │
│ node3(2181) │
│ node4(2181) │
│ node5(2181) │
└──────────┬──────────┘
│ HA 状态存储
┌────────────────────┼────────────────────┐
│ │ │
┌────────▼────────┐ ┌───────▼───────┐ ┌───────▼───────┐
│ ZKFC (node1) │ │ ZKFC (node2) │ │ JournalNode │
│ NameNode(A) │ │ NameNode(S) │ │ 3/4/5 (QJM) │
└────────┬────────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
└───────────────────┴────────────────────┘
│ EditLog 共享存储
┌───────────────────┴────────────────────┐
│ DataNode 层 │
│ node3 / node4 / node5 │
└────────────────────────────────────────┘关键设计:
- JournalNode (QJM): 3 个 JournalNode 形成 Quorum,为两个 NameNode 提供共享的 EditLog 存储。NameNode Active 向 JN 写入 EditLog,Standby 从 JN 读取 EditLog 并合并到 FSImage 中
- ZKFC: 每个 NameNode 上运行一个 ZKFC 进程,负责监控 NameNode 健康状态并通过 ZooKeeper 锁机制进行故障转移
- Fencing: HA 故障转移时的隔离机制,防止脑裂(split-brain)
8GB 节点内存规划:
- NameNode 堆内存: 1024MB (默认 4GB,下调以适配小内存节点)
- DataNode 堆内存: 512MB
- JournalNode 堆内存: 512MB
- NodeManager 可用内存: 4096MB
- 单个 Container 最小/最大: 512MB / 3072MB
- Map Task 内存: 1024MB, Reduce Task 内存: 2048MB
5.2 确认 Hadoop 安装
bash
ls /opt/module/hadoop/
# 预期包含: bin, sbin, etc, lib, libexec, share, include 等
# 如果未解压:
# cd /opt/software/
# tar -xzf hadoop-3.4.0.tar.gz -C /opt/module/
# mv /opt/module/hadoop-3.4.0 /opt/module/hadoop5.3 core-site.xml —— 核心配置
core-site.xml 定义 Hadoop 层面的全局配置,包括:
- 默认文件系统(HDFS Nameservice)
- ZooKeeper 地址(HA 所需)
- Kerberos 安全和认证映射
- HTTPS/SSL 通信配置
- 代理用户配置
bash
cat > /opt/module/hadoop/etc/hadoop/core-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- ============================================================
默认文件系统:通过 Nameservice ID 引用 HA 集群
============================================================ -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://bigdata-cluster</value>
<description>HDFS 默认文件系统 (HA Nameservice 名称)</description>
</property>
<!-- ============================================================
运行时临时目录
============================================================ -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/data/hadoop/tmp</value>
<description>Hadoop 临时文件目录,存放本地中间数据</description>
</property>
<!-- ============================================================
ZooKeeper 集群地址 (HA 故障转移必需)
============================================================ -->
<property>
<name>ha.zookeeper.quorum</name>
<value>node3:2181,node4:2181,node5:2181</value>
<description>ZooKeeper 集群连接地址</description>
</property>
<!-- ============================================================
Kerberos 安全认证启用
============================================================ -->
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
<description>启用 Kerberos 认证 (可选值: simple | kerberos)</description>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
<description>启用 HDFS 权限检查 (POSIX 模型)</description>
</property>
<!-- ============================================================
Kerberos Principal -> 本地用户映射规则
规则解读:
RULE:[1:$1@$0] -> 取 principal 第一部分作为本地用户名
(如 hdfs/node1@BIGDATA.CLUSTER -> hdfs)
RULE:[2:$1@$0] -> 取 principal 第二部分作为本地用户名
(如 service/host@BIGDATA.CLUSTER -> host)
DEFAULT -> 直接使用完整 principal 名
============================================================ -->
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[1:$1@$0](.*@BIGDATA\.CLUSTER)s/@.*//
RULE:[2:$1@$0](.*@BIGDATA\.CLUSTER)s/@.*//
DEFAULT
</value>
<description>Auth-to-Local: 将 Kerberos Principal 映射为本地 OS 用户名</description>
</property>
<!-- ============================================================
HTTPS / SSL 协议和配置
============================================================ -->
<property>
<name>hadoop.ssl.enabled.protocols</name>
<value>TLSv1.2,TLSv1.3</value>
</property>
<property>
<name>hadoop.ssl.hostname.verifier</name>
<value>DEFAULT</value>
</property>
<property>
<name>hadoop.ssl.require.client.cert</name>
<value>false</value>
</property>
<property>
<name>hadoop.ssl.keystores.factory.class</name>
<value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
</property>
<!-- ============================================================
代理用户 (Proxy User): 允许超级用户代理其他用户提交作业
作用: Hive/Kyuubi 等服务以自身身份运行,但需要代理实际用户访问 HDFS
============================================================ -->
<property>
<name>hadoop.proxyuser.causes.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.causes.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<!-- ============================================================
Block Access Token (安全模式下 DataNode 访问控制)
============================================================ -->
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<!-- ============================================================
各组件 Kerberos Principal 模板 (_HOST 会自动替换为实际主机名)
============================================================ -->
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>nn/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>dn/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>dfs.journalnode.kerberos.principal</name>
<value>jn/_HOST@BIGDATA.CLUSTER</value>
</property>
</configuration>
EOFauth_to_local 规则说明: 这是生产环境中最容易出错的配置之一。
_HOST在运行时自动替换为节点主机名。映射后的本地用户名必须与调用命令的 OS 用户匹配,否则会出现 "User xxx doesn't have permission" 错误。
5.4 hdfs-site.xml —— HDFS HA 配置
bash
cat > /opt/module/hadoop/etc/hadoop/hdfs-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- ============================================================
1. HA Nameservice 定义
============================================================ -->
<property>
<name>dfs.nameservices</name>
<value>bigdata-cluster</value>
<description>HA Nameservice 的逻辑名称,客户端通过此名称访问 HDFS</description>
</property>
<!-- ============================================================
2. NameNode 标识列表
============================================================ -->
<property>
<name>dfs.ha.namenodes.bigdata-cluster</name>
<value>nn1,nn2</value>
</property>
<!-- ============================================================
3. NameNode RPC 地址 (Hadoop 内部通信端口)
============================================================ -->
<property>
<name>dfs.namenode.rpc-address.bigdata-cluster.nn1</name>
<value>node1:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.bigdata-cluster.nn2</name>
<value>node2:8020</value>
</property>
<!-- ============================================================
4. NameNode HTTP Web UI 地址
============================================================ -->
<property>
<name>dfs.namenode.http-address.bigdata-cluster.nn1</name>
<value>node1:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.bigdata-cluster.nn2</name>
<value>node2:9870</value>
</property>
<!-- ============================================================
5. NameNode HTTPS Web UI 地址 (安全访问)
============================================================ -->
<property>
<name>dfs.namenode.https-address.bigdata-cluster.nn1</name>
<value>node1:9871</value>
</property>
<property>
<name>dfs.namenode.https-address.bigdata-cluster.nn2</name>
<value>node2:9871</value>
</property>
<!-- ============================================================
6. JournalNode Quorum 共享编辑日志目录
格式: qjournal://host1:port1;host2:port2;.../nameservice_id
============================================================ -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node3:8485;node4:8485;node5:8485/bigdata-cluster</value>
<description>JournalNode 集群地址,NameNode 通过此目录共享 EditLog</description>
</property>
<!-- ============================================================
7. JournalNode 本地编辑日志存储目录
============================================================ -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/opt/data/hadoop/journal</value>
</property>
<!-- ============================================================
8. NameNode 元数据目录 (fsimage + edits)
============================================================ -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/data/hadoop/name</value>
</property>
<!-- ============================================================
9. DataNode 数据块存放目录
============================================================ -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/data/hadoop/data</value>
</property>
<!-- ============================================================
10. 默认副本数 (3 个 DN 节点,设为 2 保证冗余并节省空间)
============================================================ -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!-- ============================================================
11. Block 大小 (128MB = 134217728 bytes)
大 Block 适合大文件场景,减少 NameNode 元数据开销
============================================================ -->
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<!-- ============================================================
12. HA 自动故障转移
============================================================ -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
<description>启用基于 ZooKeeper 的自动故障转移</description>
</property>
<!-- ============================================================
13. Fencing 隔离方法
此处使用简单方案 shell(/bin/true),
生产环境建议配置 sshfence 或其它更强方案
============================================================ -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
<description>Fencing 隔离方法,生产环境建议使用 sshfence</description>
</property>
<!-- ============================================================
14. 客户端故障转移代理
客户端通过此代理自动选择 Active NameNode
============================================================ -->
<property>
<name>dfs.client.failover.proxy.provider.bigdata-cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- ============================================================
15. DataNode 数据传输 Kerberos 保护级别
authentication: 仅认证,不加密(性能优先)
integrity: 认证 + 完整性校验
privacy: 认证 + 完整性 + 加密(最安全,性能影响大)
============================================================ -->
<property>
<name>dfs.data.transfer.protection</name>
<value>authentication</value>
</property>
<!-- ============================================================
16. Web UI 安全策略
HTTPS_ONLY: 仅允许 HTTPS (HTTP 自动重定向到 HTTPS)
HTTP_ONLY: 仅允许 HTTP (不推荐)
HTTP_AND_HTTPS: 两者均可
============================================================ -->
<property>
<name>dfs.http.policy</name>
<value>HTTPS_ONLY</value>
</property>
<!-- ============================================================
17. Web Authentication (SPNEGO)
SPNEGO = Simple and Protected GSSAPI Negotiation Mechanism
通过浏览器访问 HDFS Web UI 时进行 Kerberos 认证
============================================================ -->
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/opt/config/kerberos/keytabs/HTTP.keytab</value>
</property>
<!-- ============================================================
18. SSL 密钥库配置引用
============================================================ -->
<property>
<name>dfs.https.server.keystore.resource</name>
<value>ssl-server.xml</value>
<description>引用 ssl-server.xml 中的 SSL 密钥库配置</description>
</property>
<property>
<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>HTTP/_HOST@BIGDATA.CLUSTER</value>
</property>
</configuration>
EOF5.5 SSL 证书生成与配置
目的: Hadoop HTTPS Web UI 和内部 RPC 通信使用 SSL/TLS 加密。以下提供两种场景的证书配置方式,根据实际环境选择:场景 A 适用于测试/无统一 CA 环境,场景 B 适用于生产环境。
场景 A:自签证书(适用于测试/无统一CA环境)
使用 Java keytool 为每个节点单独生成自签名证书。各节点拥有独立的 keystore(含私钥 + 自签名证书),truststore 汇总所有节点的公钥证书,所有节点共用同一个 truststore。
步骤1 —— 为每个节点生成 keystore 并导出证书:
bash
# 创建 SSL 配置目录
mkdir -p /opt/config/hadoop/ssl
mkdir -p /opt/config/hadoop/ssl/certs
# 为 5 个节点逐一生成自签名密钥库
# - keyalg RSA: 使用 RSA 算法
# - keysize 2048: 2048 位密钥
# - validity 3650: 证书有效期 10 年
# - CN=nodeX.bigdata.cluster: 每个节点使用自己的主机名作为 CN
# - SAN 包含 hostname 和 IP(通过 -ext 扩展参数指定)
for node in node1 node2 node3 node4 node5; do
# 根据主机名确定 IP
case $node in
node1) IP=192.168.164.131 ;;
node2) IP=192.168.164.132 ;;
node3) IP=192.168.164.133 ;;
node4) IP=192.168.164.134 ;;
node5) IP=192.168.164.135 ;;
esac
keytool -genkeypair \
-alias hadoop-server \
-keyalg RSA -keysize 2048 \
-keystore /opt/config/hadoop/ssl/keystore.jks \
-storepass BigData@2026 -keypass BigData@2026 \
-dname "CN=${node}.bigdata.cluster, OU=BigData, O=Cluster, L=Beijing, ST=Beijing, C=CN" \
-validity 3650 \
-ext SAN=DNS:${node},DNS:${node}.bigdata.cluster,IP:${IP}
# 导出各节点的公钥证书
keytool -exportcert \
-alias hadoop-server \
-keystore /opt/config/hadoop/ssl/keystore.jks \
-storepass BigData@2026 \
-file /opt/config/hadoop/ssl/certs/${node}.crt
# 将此节点的 keystore 和证书移动到以节点命名的目录中(便于分发)
mkdir -p /opt/config/hadoop/ssl/${node}
cp /opt/config/hadoop/ssl/keystore.jks /opt/config/hadoop/ssl/${node}/keystore.jks
echo "[OK] ${node} keystore and certificate generated"
done注意: 以上循环会在 node1 上为所有 5 个节点生成 keystore,然后将各节点的 keystore 分发至对应机器。每次迭代会覆盖
/opt/config/hadoop/ssl/keystore.jks,最终的 keystore.jks 属于最后一个节点(node5),分发时需从对应子目录读取。
步骤2 —— 创建共享 truststore,导入所有节点的证书:
bash
# 创建空的 truststore,导入所有节点的公钥证书
# 先以第一个节点的证书初始化 truststore
keytool -importcert \
-alias node1 \
-keystore /opt/config/hadoop/ssl/truststore.jks \
-storepass BigData@2026 \
-file /opt/config/hadoop/ssl/certs/node1.crt \
-noprompt
# 逐个导入其余节点的证书
for node in node2 node3 node4 node5; do
keytool -importcert \
-alias ${node} \
-keystore /opt/config/hadoop/ssl/truststore.jks \
-storepass BigData@2026 \
-file /opt/config/hadoop/ssl/certs/${node}.crt \
-noprompt
done
# 验证 truststore 内容(应列出 5 个条目)
keytool -list -keystore /opt/config/hadoop/ssl/truststore.jks -storepass BigData@2026步骤3 —— 创建 ssl-server.xml(服务端 SSL 配置):
bash
cat > /opt/module/hadoop/etc/hadoop/ssl-server.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>ssl.server.keystore.location</name>
<value>/opt/config/hadoop/ssl/keystore.jks</value>
</property>
<property>
<name>ssl.server.keystore.password</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.server.keystore.keypassword</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.location</name>
<value>/opt/config/hadoop/ssl/truststore.jks</value>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
</configuration>
EOF步骤4 —— 创建 ssl-client.xml(客户端 SSL 配置):
bash
cat > /opt/module/hadoop/etc/hadoop/ssl-client.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>ssl.client.truststore.location</name>
<value>/opt/config/hadoop/ssl/truststore.jks</value>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
</configuration>
EOF步骤5 —— 分发 SSL 证书到各节点(每个节点获取自己的 keystore + 共享 truststore):
bash
# 分发各节点的专属 keystore
for node in node2 node3 node4 node5; do
ssh $node "mkdir -p /opt/config/hadoop/ssl"
scp /opt/config/hadoop/ssl/${node}/keystore.jks ${node}:/opt/config/hadoop/ssl/keystore.jks
done
# node1 使用当前目录的 keystore(最后一次迭代为 node5 的,需替换为 node1 的)
cp /opt/config/hadoop/ssl/node1/keystore.jks /opt/config/hadoop/ssl/keystore.jks
# 分发共享 truststore 和 ssl XML 配置到所有节点
for host in node2 node3 node4 node5; do
scp /opt/config/hadoop/ssl/truststore.jks $host:/opt/config/hadoop/ssl/
scp /opt/module/hadoop/etc/hadoop/ssl-{server,client}.xml $host:/opt/module/hadoop/etc/hadoop/
done说明: 该方案下 truststore 所有节点共用同一份(含 5 个节点的公钥证书),keystore 各节点独立。证书到期后需重新生成并分发。
场景 B:使用公司统一CA颁发的证书(生产环境标准)
公司统一CA颁发的证书具有正式的证书链,浏览器和客户端无需额外导入即可信任(前提是客户端已信任公司 CA 根证书)。公司 CA 通常会提供以下文件:
- CA 根证书:
ca.crt或ca.pem(用于构建 truststore) - 各节点服务证书 + 私钥: 通常为 PKCS12 (
.p12/.pfx) 格式,或独立的.crt+.key文件 - 证书 SAN: 证书通常已包含正确的 SAN(Subject Alternative Name),含主机名和 IP
前提要求:
公司CA提供:
ca.crt # CA 根证书
node1.p12 (或 node1.crt+node1.key) # node1 的服务证书
node2.p12 (或 node2.crt+node2.key) # node2 的服务证书
node3.p12 (或 node3.crt+node3.key) # node3 的服务证书
node4.p12 (或 node4.crt+node4.key) # node4 的服务证书
node5.p12 (或 node5.crt+node5.key) # node5 的服务证书步骤1 —— 创建 SSL 目录并导入 CA 根证书构建 truststore:
bash
# 创建目录
mkdir -p /opt/config/ssl
mkdir -p /opt/config/ssl/certs
# 将公司提供的 ca.crt 放入 /opt/config/ssl/certs/
# 将 CA 根证书导入 truststore(所有节点共用)
keytool -import -trustcacerts -alias ca -file /opt/config/ssl/certs/ca.crt \
-keystore /opt/config/ssl/truststore -storepass BigData@2026 -noprompt步骤2 —— 为各节点导入服务证书(在对应节点上执行,或在 node1 上逐个处理再分发):
以下操作需将各节点证书转换为 JKS 格式的 keystore。根据公司提供的证书格式,分两种情况:
情况 2a:公司提供 PKCS12 格式证书(.p12 / .pfx):
bash
# 将公司提供的 PKCS12 证书转换为 JKS keystore
# <公司给的密码> 替换为证书的实际密码
keytool -importkeystore \
-srckeystore node1.p12 -srcstoretype PKCS12 -srcstorepass <公司给的密码> \
-destkeystore /opt/config/ssl/keystore -deststoretype JKS -deststorepass BigData@2026
# 验证 keystore 内容
keytool -list -keystore /opt/config/ssl/keystore -storepass BigData@2026情况 2b:公司提供独立 crt + key 文件:
bash
# 先使用 openssl 将 crt + key 合成为 PKCS12
openssl pkcs12 -export -in node1.crt -inkey node1.key \
-out /tmp/node1.p12 -password pass:temp123
# 再将 PKCS12 转换为 JKS
keytool -importkeystore \
-srckeystore /tmp/node1.p12 -srcstoretype PKCS12 -srcstorepass temp123 \
-destkeystore /opt/config/ssl/keystore -deststoretype JKS -deststorepass BigData@2026
# 清理临时文件
rm -f /tmp/node1.p12批量处理参考 (在 node1 上为所有节点生成 keystore 后分发):
bash# 假设公司提供的是 PKCS12 格式,且所有节点密码相同 for node in node1 node2 node3 node4 node5; do mkdir -p /opt/config/ssl/${node} keytool -importkeystore \ -srckeystore ${node}.p12 -srcstoretype PKCS12 -srcstorepass <公司给的密码> \ -destkeystore /opt/config/ssl/${node}/keystore -deststoretype JKS -deststorepass BigData@2026 done
步骤3 —— 创建 ssl-server.xml:
bash
cat > /opt/module/hadoop/etc/hadoop/ssl-server.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>ssl.server.keystore.location</name>
<value>/opt/config/ssl/keystore</value>
</property>
<property>
<name>ssl.server.keystore.password</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.server.keystore.keypassword</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.location</name>
<value>/opt/config/ssl/truststore</value>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
</configuration>
EOF步骤4 —— 创建 ssl-client.xml:
bash
cat > /opt/module/hadoop/etc/hadoop/ssl-client.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>ssl.client.truststore.location</name>
<value>/opt/config/ssl/truststore</value>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>BigData@2026</value>
</property>
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
</configuration>
EOF步骤5 —— 分发证书和配置到各节点:
bash
# 分发各节点的专属 keystore
for node in node2 node3 node4 node5; do
ssh $node "mkdir -p /opt/config/ssl"
scp /opt/config/ssl/${node}/keystore ${node}:/opt/config/ssl/keystore
done
# node1 的 keystore(若在 node1 子目录下,需复制到标准路径)
cp /opt/config/ssl/node1/keystore /opt/config/ssl/keystore 2>/dev/null || true
# 分发共享 truststore 和 ssl XML 配置
for host in node2 node3 node4 node5; do
scp /opt/config/ssl/truststore $host:/opt/config/ssl/
scp /opt/module/hadoop/etc/hadoop/ssl-{server,client}.xml $host:/opt/module/hadoop/etc/hadoop/
done注意事项:
| 项目 | 说明 |
|---|---|
| 证书 SAN | 必须包含节点 hostname 和 IP,否则 HTTPS 访问时主机名验证会失败 |
| 各节点证书 | 各节点证书不同,不能共用一个 keystore(每个 keystore 包含不同的私钥) |
| truststore | 含 CA 根证书,所有节点可以共用同一份 |
| 证书过期 | 证书过期前需要向公司 CA 申请更新,重新导入并重启服务 |
| 密码管理 | 生产环境建议使用更高强度的密码,并通过密钥管理工具(如 Vault)保护 |
5.6 yarn-site.xml —— YARN 资源管理配置
bash
cat > /opt/module/hadoop/etc/hadoop/yarn-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- ============================================================
YARN Shuffle 服务 (MapReduce 中间结果传输)
============================================================ -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!-- ============================================================
ResourceManager 地址 (部署在 node2)
============================================================ -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node2</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>node2:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node2:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>node2:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>node2:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node2:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>node2:8090</value>
</property>
<!-- ============================================================
日志聚合配置
启用后将应用日志从本地汇总到 HDFS,便于统一查看历史作业日志
============================================================ -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>启用日志聚合</description>
</property>
<property>
<name>yarn.log.server.url</name>
<value>https://node1:19890/jobhistory/logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>2592000</value>
<description>聚合日志保留 30 天 (2592000 秒)</description>
</property>
<!-- ============================================================
内存和 CPU 资源限制 (适配 8GB 小内存节点)
NodeManager 可用内存设为 4GB,预留 4GB 给系统和其他服务
============================================================ -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>NodeManager 可用总内存 (MB)</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
<description>NodeManager 可用虚拟 CPU 核数</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
<description>单个 Container 最小内存</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3072</value>
<description>单个 Container 最大内存 (小于 NM 总量,留缓冲)</description>
</property>
<!-- ============================================================
YARN Kerberos 安全配置
============================================================ -->
<property>
<name>yarn.resourcemanager.principal</name>
<value>rm/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>yarn.resourcemanager.keytab</name>
<value>/opt/config/kerberos/keytabs/rm.service.keytab</value>
</property>
<property>
<name>yarn.nodemanager.principal</name>
<value>nm/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>yarn.nodemanager.keytab</name>
<value>/opt/config/kerberos/keytabs/nm.service.keytab</value>
</property>
<!-- ============================================================
YARN Web 认证 (SPNEGO)
============================================================ -->
<property>
<name>yarn.webapp.spnego-principal</name>
<value>HTTP/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>yarn.webapp.spnego-keytab-file</name>
<value>/opt/config/kerberos/keytabs/HTTP.keytab</value>
</property>
</configuration>
EOF5.7 mapred-site.xml —— MapReduce 配置
bash
cat > /opt/module/hadoop/etc/hadoop/mapred-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- ============================================================
运行框架:使用 YARN
============================================================ -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- ============================================================
MapReduce Application Classpath
============================================================ -->
<property>
<name>mapreduce.application.classpath</name>
<value>
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
<!-- ============================================================
管理员列表
============================================================ -->
<property>
<name>mapreduce.cluster.administrators</name>
<value>hadoop,yarn,causes</value>
</property>
<!-- ============================================================
JobHistory Server 地址 (部署在 node1)
============================================================ -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>node1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node1:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.https.address</name>
<value>node1:19890</value>
</property>
<!-- ============================================================
Map/Reduce 任务内存限制 (适配 8GB 节点)
Java Opts 通常设为 memory.mb 的 80%,预留 20% 给堆外内存
============================================================ -->
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx819m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx1638m</value>
</property>
<!-- ============================================================
JobHistory Kerberos 安全配置
============================================================ -->
<property>
<name>mapreduce.jobhistory.principal</name>
<value>HTTP/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>mapreduce.jobhistory.keytab</name>
<value>/opt/config/kerberos/keytabs/HTTP.keytab</value>
</property>
</configuration>
EOF5.8 workers 文件
说明: workers 文件列出所有 DataNode 和 NodeManager 节点的主机名(每行一个)。
bash
cat > /opt/module/hadoop/etc/hadoop/workers << 'EOF'
node3
node4
node5
EOF5.9 hadoop-env.sh —— Hadoop 环境变量和 JVM 参数
bash
cat > /opt/module/hadoop/etc/hadoop/hadoop-env.sh << 'EOF'
# ============================================================
# Hadoop 环境变量脚本
# ============================================================
export JAVA_HOME=/opt/module/jdk8
# ============================================================
# 各组件 JVM 堆内存限制 (适配 8GB 节点)
# HADOOP_NAMENODE_OPTS 中的 -Xmx 控制 NameNode 最大堆内存
# ============================================================
export HADOOP_NAMENODE_OPTS="-Xmx1024m -Xms1024m $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Xmx512m -Xms512m $HADOOP_DATANODE_OPTS"
export HADOOP_JOURNALNODE_OPTS="-Xmx512m -Xms512m $HADOOP_JOURNALNODE_OPTS"
# YARN 进程内存限制
export YARN_RESOURCEMANAGER_OPTS="-Xmx1024m $YARN_RESOURCEMANAGER_OPTS"
export YARN_NODEMANAGER_OPTS="-Xmx1024m $YARN_NODEMANAGER_OPTS"
# ============================================================
# Kerberos 相关的 JVM 系统属性
# javax.security.auth.useSubjectCredsOnly=false
# 允许使用 JAAS Subject 中的凭据进行认证
# ============================================================
export HADOOP_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf \
-Djava.security.krb5.realm=BIGDATA.CLUSTER \
-Djava.security.krb5.kdc=node1:88 \
-Djavax.security.auth.useSubjectCredsOnly=false \
-Dsun.security.krb5.debug=false \
$HADOOP_OPTS"
export HADOOP_CLIENT_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf \
-Djava.security.krb5.realm=BIGDATA.CLUSTER \
$HADOOP_CLIENT_OPTS"
EOF5.10 container-executor.cfg (安全模式必需)
说明: 在 Kerberos 安全模式下,YARN 使用 container-executor 可执行程序来以特定用户身份启动容器。该文件必须是 root:group 所有且权限为 400 或 640。
bash
sudo tee /opt/module/hadoop/etc/hadoop/container-executor.cfg << 'EOF'
yarn.nodemanager.local-dirs=/opt/data/hadoop/nm-local-dir
yarn.nodemanager.log-dirs=/opt/data/hadoop/userlogs
# container-executor 运行容器时使用的用户组
yarn.nodemanager.linux-container-executor.group=causes
# 禁止以这些系统用户身份运行容器
banned.users=hdfs,yarn,mapred,bin
# 最小允许的用户 ID (禁止以 root 身份运行)
min.user.id=1000
EOF
# 设置权限 (container-executor 可执行文件必须是 6050: setuid + setgid)
sudo chown root:causes /opt/module/hadoop/etc/hadoop/container-executor.cfg
sudo chown root:causes /opt/module/hadoop/bin/container-executor
sudo chmod 6050 /opt/module/hadoop/bin/container-executor
sudo chmod 640 /opt/module/hadoop/etc/hadoop/container-executor.cfg5.11 分发 Hadoop 配置到所有节点
bash
# 分发 XML 配置文件和核心脚本
for host in node2 node3 node4 node5; do
echo "========== 分发 Hadoop 配置到 $host =========="
# 复制所有 XML 配置文件
scp /opt/module/hadoop/etc/hadoop/*.xml $host:/opt/module/hadoop/etc/hadoop/
# 复制 workers 和环境变量脚本
scp /opt/module/hadoop/etc/hadoop/{workers,hadoop-env.sh} $host:/opt/module/hadoop/etc/hadoop/
done
# 分发 container-executor.cfg 到 NodeManager 节点 (node3/4/5)
for host in node3 node4 node5; do
sudo scp /opt/module/hadoop/etc/hadoop/container-executor.cfg $host:/opt/module/hadoop/etc/hadoop/
ssh $host "sudo chown root:causes /opt/module/hadoop/bin/container-executor && sudo chmod 6050 /opt/module/hadoop/bin/container-executor && sudo chmod 640 /opt/module/hadoop/etc/hadoop/container-executor.cfg"
done
# 创建各节点的数据目录
for host in node3 node4 node5; do
ssh $host "mkdir -p /opt/data/hadoop/{tmp,name,data,journal,nm-local-dir,userlogs} && chown -R causes:causes /opt/data/hadoop"
done
# node1, node2 不需要 data 和 journal 目录
ssh node1 "mkdir -p /opt/data/hadoop/{tmp,name} && chown -R causes:causes /opt/data/hadoop"
ssh node2 "mkdir -p /opt/data/hadoop/{tmp,name} && chown -R causes:causes /opt/data/hadoop"5.12 Hadoop 初始化流程
重要: 以下步骤顺序不可颠倒。格式化前必须启动 JournalNode,否则 NameNode 格式化无法写入共享 EditLog。
步骤1 —— 启动 JournalNode (node3/4/5)
bash
# 并行启动三节点 JournalNode
for host in node3 node4 node5; do
ssh $host "source /etc/profile.d/custom_profile.sh && $HADOOP_HOME/bin/hdfs --daemon start journalnode"
done
# 验证 JournalNode 进程
for host in node3 node4 node5; do
echo "=== $host ==="
ssh $host "jps | grep JournalNode"
done
# 预期: 每个节点输出一个 JournalNode 进程 PID步骤2 —— 格式化 Active NameNode (node1)
bash
# 获取 HDFS 超级用户的 Kerberos 票据
kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs
# 格式化 NameNode (仅首次部署执行)
hdfs namenode -format预期输出 (最后几行):
INFO common.Storage: Storage directory /opt/data/hadoop/name has been successfully formatted.
INFO namenode.FSImage: Allocated new BlockPoolId: BP-1692049074-192.168.164.131-1718700000000
INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node1/192.168.164.131
************************************************************/关键输出: BlockPoolId (BP-xxxxxxxxx) 是本次格式化的唯一标识。如果重新格式化,BlockPoolId 会变化,DataNode 需要重新注册。
步骤3 —— 引导 Standby NameNode (node2)
有两种方式,推荐方式 B:
方式 A: 直接复制元数据目录
bash
# 在 node1 上: 复制格式化后的元数据到 node2
scp -r /opt/data/hadoop/name/current node2:/opt/data/hadoop/name/方式 B (推荐): 使用 Bootstrap 命令
bash
# 在 node2 上执行 (从共享 Journal Quorum 拉取元数据)
ssh node2 "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && hdfs namenode -bootstrapStandby"预期输出:
=====================================================
About to bootstrap Standby NameNode
This will overwrite any existing data in /opt/data/hadoop/name
=====================================================
......
=====================================================
Standby NameNode bootstrap completed successfully.
=====================================================步骤4 —— 格式化 ZKFC (在 ZooKeeper 中初始化 HA 状态)
bash
# 在 node1 上执行,创建 /hadoop-ha/bigdata-cluster 节点
hdfs zkfc -formatZK预期输出:
Successfully created /hadoop-ha/bigdata-cluster in ZK.验证 ZooKeeper 中的 HA 节点:
bash
/opt/module/zookeeper/bin/zkCli.sh -server node3:2181 << 'EOF'
ls /hadoop-ha
ls /hadoop-ha/bigdata-cluster
quit
EOF步骤5 —— 启动 NameNode 和 ZKFC
bash
# node1: 启动 NameNode (Active) 和 ZKFC
hdfs --daemon start namenode
hdfs --daemon start zkfc
# node2: 启动 NameNode (Standby) 和 ZKFC
ssh node2 "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && $HADOOP_HOME/bin/hdfs --daemon start namenode && $HADOOP_HOME/bin/hdfs --daemon start zkfc"步骤6 —— 启动 DataNode (node3/4/5)
bash
for host in node3 node4 node5; do
ssh $host "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && $HADOOP_HOME/bin/hdfs --daemon start datanode"
done步骤7 —— 验证 HA 状态
bash
# 查看两个 NameNode 的主备状态
hdfs haadmin -getAllServiceState预期输出:
node1:8020 active
node2:8020 standby步骤8 —— 启动 YARN
bash
# node2: 启动 ResourceManager
ssh node2 "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/yarn.keytab yarn && $HADOOP_HOME/bin/yarn --daemon start resourcemanager"
# node3/4/5: 启动 NodeManager
for host in node3 node4 node5; do
ssh $host "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/yarn.keytab yarn && $HADOOP_HOME/bin/yarn --daemon start nodemanager"
done步骤9 —— 启动 JobHistory Server (node1)
bash
mapred --daemon start historyserver5.13 总体验证
进程检查:
bash
for host in node1 node2 node3 node4 node5; do
echo "========== $host =========="
ssh $host "source /etc/profile.d/custom_profile.sh && jps | grep -v Jps | sort"
done预期各节点进程:
| 节点 | 预期进程 |
|---|---|
| node1 | NameNode, ZKFC, JobHistoryServer |
| node2 | NameNode, ZKFC, ResourceManager |
| node3 | DataNode, JournalNode, NodeManager |
| node4 | DataNode, JournalNode, NodeManager |
| node5 | DataNode, JournalNode, NodeManager |
Web UI 验证:
| 组件 | 地址 | 说明 |
|---|---|---|
| NameNode Active | https://node1:9871 | 需要浏览器配置 Kerberos |
| NameNode Standby | https://node2:9871 | 需要浏览器配置 Kerberos |
| YARN | http://node2:8088 | 资源调度界面 |
HDFS 读写测试:
bash
# 认证
kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs
# 创建目录
hdfs dfs -mkdir /tmp
hdfs dfs -chmod 1777 /tmp
# 上传测试文件
echo "Hadoop HA with Kerberos is working" > /tmp/hadoop_test.txt
hdfs dfs -put /tmp/hadoop_test.txt /tmp/
hdfs dfs -cat /tmp/hadoop_test.txt
# 预期: Hadoop HA with Kerberos is working
# 查看文件系统
hdfs dfs -ls /
# 预期: 看到 /tmp 目录
# 清理
hdfs dfs -rm /tmp/hadoop_test.txt手动 HA 故障切换测试:
bash
# 当前状态
hdfs haadmin -getAllServiceState
# 执行手动故障切换 (nn1 -> nn2)
hdfs haadmin -failover nn1 nn2
# 再次查看状态 (预期 nn2 变为 active)
hdfs haadmin -getAllServiceState
# 恢复原状
hdfs haadmin -failover nn2 nn16. Hive Metastore
6.1 Hive Metastore 架构
说明: Hive Metastore 是 Hive 的元数据服务,存储表 Schema、分区信息、列统计、SerDe 等元数据。本方案将 Metastore 服务部署在 node2,后端数据库使用 node1 上的 MySQL。
┌─────────┐ Thrift ┌─────────────────┐ JDBC ┌─────────┐
│ Hive │ ──────────────> │ Hive Metastore │ ──────────> │ MySQL │
│ Client │ port 9083 │ (node2) │ 3306 │ (node1) │
└─────────┘ └─────────────────┘ └─────────┘6.2 确认安装并添加 MySQL 驱动
bash
ls /opt/module/hive/
# 如果未解压:
# cd /opt/software/
# tar -xzf apache-hive-3.1.3-bin.tar.gz -C /opt/module/
# mv /opt/module/apache-hive-3.1.3-bin /opt/module/hive
# 复制 MySQL JDBC 驱动到 Hive lib 目录
# 驱动通常为 mysql-connector-j-8.x.x.jar
cp /opt/software/mysql-connector-j-*.jar /opt/module/hive/lib/
# 移除可能冲突的旧版本 MySQL 驱动
rm -f /opt/module/hive/lib/mysql-connector-java-5*.jar
# 验证驱动已放置
ls /opt/module/hive/lib/mysql-connector-*6.3 hive-site.xml
在 node2 上创建 Hive 配置文件:
bash
cat > /opt/module/hive/conf/hive-site.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- ============================================================
Metastore 数据库连接 (MySQL on node1)
createDatabaseIfNotExist=true: Hive 自动创建元数据库
allowPublicKeyRetrieval=true: MySQL 8.0 的 caching_sha2_password
============================================================ -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node1:3306/hive_metastore?createDatabaseIfNotExist=true&useSSL=false&allowPublicKeyRetrieval=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Hive@2026!</value>
</property>
<!-- ============================================================
数据仓库 HDFS 目录
============================================================ -->
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/warehouse</value>
<description>Hive 表的默认 HDFS 存储目录</description>
</property>
<!-- ============================================================
Metastore Thrift 服务地址
============================================================ -->
<property>
<name>hive.metastore.uris</name>
<value>thrift://node2:9083</value>
<description>Metastore Thrift 服务地址 (客户端连接此地址)</description>
</property>
<!-- ============================================================
Schema 验证关闭 (避免版本不一致警告)
============================================================ -->
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<!-- ============================================================
Kerberos 安全认证
============================================================ -->
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
<description>Metastore SASL/GSSAPI 认证</description>
</property>
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>hive.metastore.kerberos.keytab.file</name>
<value>/opt/config/kerberos/keytabs/hive.service.keytab</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>KERBEROS</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/_HOST@BIGDATA.CLUSTER</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.keytab</name>
<value>/opt/config/kerberos/keytabs/hive.service.keytab</value>
</property>
<!-- ============================================================
本地临时目录
============================================================ -->
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/hive/resources</value>
</property>
<!-- ============================================================
HiveServer2 绑定地址和端口
============================================================ -->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>node2</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
</configuration>
EOF6.5 Kerberos 认证注意事项
关键配置说明:
hive.metastore.execute.setugi=false— Metastore 使用自身 Kerberos 身份操作 HDFS,不模拟客户端。设置为false可避免因客户端凭证未传递导致的Client cannot authenticate via:[TOKEN, KERBEROS]错误hive.metastore.kerberos.principal=hive/_HOST@BIGDATA.CLUSTER— Metastore 的 Kerberos 主体hive.metastore.kerberos.keytab.file— 指向含hive/<hostname>主体的 keytab
Metastore 启动方式: Metastore 进程需要持有有效的 Kerberos 票据才能操作 HDFS。启动前需用 keytab 获取票据:
bash
kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs/node2@BIGDATA.CLUSTER
nohup bash -c "export KRB5CCNAME=FILE:/tmp/krb5cc_1000 JAVA_HOME=/opt/module/jdk8 HADOOP_HOME=/opt/module/hadoop; /opt/module/hive/bin/hive --service metastore" > /opt/data/hadoop/logs/metastore.log 2>&1 &常见错误排查:
| 错误信息 | 原因 | 解决 |
|---|---|---|
Client cannot authenticate via:[TOKEN, KERBEROS] | Metastore 无 Kerberos 票据或 setugi=true 但客户端未传凭证 | 确保 kinit 成功 + 设置 setugi=false |
| Metastore hang 在 SLF4J 后不动 | 等待 HDFS Kerberos 认证超时 | 检查 KRB5CCNAME 环境变量是否传进 JVM |
6.6 初始化 Hive Metastore Schema
bash
# 在 node2 上执行 (需要 Kerberos 认证)
ssh node2 "source /etc/profile.d/custom_profile.sh && cd /opt/module/hive && kinit -kt /opt/config/kerberos/keytabs/hive.keytab hive"
# 初始化 MySQL 中的 Metastore Schema (仅首次部署执行)
ssh node2 "source /etc/profile.d/custom_profile.sh && cd /opt/module/hive && schematool -dbType mysql -initSchema"预期输出:
Initialization script completed
schemaTool completed验证 Schema 创建:
bash
docker exec mysql8 mysql -u root -pRoot@2026! -e "USE hive_metastore; SHOW TABLES;"
# 预期: 看到 VERSION, DBS, TBLS, PARTITIONS, COLUMNS_V2, SDS 等 60+ 张表6.7 启动 Hive Metastore 服务
bash
# 在 node2 上
ssh node2 "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/hive.service.keytab hive/node2@BIGDATA.CLUSTER && nohup $HIVE_HOME/bin/hive --service metastore > /opt/data/hive/metastore.log 2>&1 &"
# 保存 PID
ssh node2 "echo \$! > /opt/data/hive/metastore.pid"
# 等待几秒后验证
sleep 5
ssh node2 "netstat -tlnp 2>/dev/null | grep 9083"
# 预期: 看到 0.0.0.0:9083 的 LISTEN 状态Beeline 连接测试 (从 node1):
bash
kinit -kt /opt/config/kerberos/keytabs/hive.keytab hive
/opt/module/hive/bin/beeline -u "jdbc:hive2://node2:10000/default;principal=hive/node2@BIGDATA.CLUSTER" -e "SHOW DATABASES;"
# 预期: 列出 default 数据库7. Kafka 集群 (KRaft)
7.1 KRaft 模式简介
说明: Kafka 4.3.0 原生支持 KRaft (Kafka Raft Consensus) 模式,将元数据管理从前置 ZooKeeper 迁移到 Kafka 内部。KRaft 模式下,部分 Broker 兼任 Controller 角色,通过 Raft 协议选举 Leader Controller。
本集群配置:
- 三节点 Combined 模式 (每个节点同时是 Broker + Controller)
- Broker ID: node3=3, node4=4, node5=5
- Controller Quorum Voters: 3@node3:9093, 4@node4:9093, 5@node5:9093
- 通信协议: SASL_PLAINTEXT (Kerberos 认证,不加密)
KRaft vs ZooKeeper 对比:
| 特性 | ZooKeeper 模式 | KRaft 模式 |
|---|---|---|
| 外部依赖 | 需要 ZK 集群 | 无外部依赖 |
| 元数据存储 | ZooKeeper | 内部 Raft Log |
| 扩展性 | 受 ZK 限制 (10万+分区) | 百万级分区 |
| 运维复杂度 | 高 (两套系统) | 低 (单系统) |
7.2 确认 Kafka 安装
bash
ls /opt/module/kafka/
# 如果未解压:
# cd /opt/software/
# tar -xzf kafka_2.13-4.3.0.tgz -C /opt/module/
# mv /opt/module/kafka_2.13-4.3.0 /opt/module/kafka7.3 生成集群 UUID
bash
# 在 node1 上生成一个唯一 UUID,所有节点共用
KAFKA_CLUSTER_ID=$($JAVA_HOME/bin/java -cp /opt/module/kafka/libs/* kafka.tools.StorageTool random-uuid)
echo "KAFKA_CLUSTER_ID=$KAFKA_CLUSTER_ID"
# 保存到环境文件
mkdir -p /opt/config/kafka
echo "KAFKA_CLUSTER_ID=$KAFKA_CLUSTER_ID" > /opt/config/kafka/cluster_id.env
cat /opt/config/kafka/cluster_id.env预期输出 (UUID 示例):
KAFKA_CLUSTER_ID=aB3dEfGh-IjKl-MnOp-QrSt-UvWxYzAbCdEf7.4 server.properties 配置
先在 node3 上创建模板配置:
bash
cat > /opt/module/kafka/config/server.properties << 'EOF'
# ============================================================
# Kafka 4.3.0 KRaft 模式配置模板 (node3)
# ============================================================
# --- Broker 标识 (各节点唯一) ---
broker.id=3
# --- 监听地址和协议 ---
# SASL_PLAINTEXT: Kerberos 认证但不加密(内网安全场景)
listeners=SASL_PLAINTEXT://0.0.0.0:9092
advertised.listeners=SASL_PLAINTEXT://node3:9092
# --- Controller 通信 ---
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:SASL_PLAINTEXT,SASL_PLAINTEXT:SASL_PLAINTEXT
# --- KRaft Controller Quorum Voters ---
# 格式: broker.id@hostname:controller_port
controller.quorum.voters=3@node3:9093,4@node4:9093,5@node5:9093
# --- 数据存储 ---
log.dirs=/opt/data/kafka/data
# --- 默认 Topic 配置 ---
num.partitions=3
default.replication.factor=2
min.insync.replicas=1
# --- 日志保留策略 ---
# 保留 7 天 (168 小时) 或 1GB,以先到为准
log.retention.hours=168
log.retention.bytes=1073741824
log.segment.bytes=268435456
log.retention.check.interval.ms=300000
# --- ZooKeeper 连接 (KRaft 模式留空) ---
zookeeper.connect=
# ============================================================
# Kerberos SASL 认证配置
# ============================================================
sasl.enabled.mechanisms=GSSAPI
sasl.mechanism.inter.broker.protocol=GSSAPI
sasl.kerberos.service.name=kafka
security.inter.broker.protocol=SASL_PLAINTEXT
# 超级用户 (可绕过 ACL,用于管理操作)
super.users=User:kafka;User:causes
# ============================================================
# 内部 Topic 副本配置
# ============================================================
offsets.topic.replication.factor=2
transaction.state.log.replication.factor=2
transaction.state.log.min.isr=1
# ============================================================
# 网络和 IO 线程
# ============================================================
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.recovery.threads.per.data.dir=1
group.initial.rebalance.delay.ms=0
EOF7.5 JAAS 认证配置文件
JAAS (Java Authentication and Authorization Service) 文件配置 Kerberos 登录模块:
Kafka 服务端 JAAS (kafka_server_jaas.conf):
bash
cat > /opt/module/kafka/config/kafka_server_jaas.conf << 'EOF'
// ============================================================
// Kafka Broker 服务端 JAAS 配置
// ============================================================
KafkaServer {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/opt/config/kerberos/keytabs/kafka.service.keytab"
principal="kafka/_HOST@BIGDATA.CLUSTER";
};
// KafkaServer 内部使用 Client section 连接自身或其它 Broker
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/opt/config/kerberos/keytabs/kafka.keytab"
principal="kafka@BIGDATA.CLUSTER";
};
KafkaController {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/opt/config/kerberos/keytabs/kafka.service.keytab"
principal="kafka/_HOST@BIGDATA.CLUSTER";
};
EOFKafka 客户端 JAAS (kafka_client_jaas.conf):
bash
cat > /opt/module/kafka/config/kafka_client_jaas.conf << 'EOF'
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
keyTab="/opt/config/kerberos/keytabs/kafka.keytab"
principal="kafka@BIGDATA.CLUSTER";
};
EOF7.6 分发 Kafka 并调整各节点配置
bash
# 获取 UUID
source /opt/config/kafka/cluster_id.env
# 分发整个 Kafka 目录到 node4 和 node5
for host in node4 node5; do
scp -r /opt/module/kafka $host:/opt/module/
scp /opt/config/kafka/cluster_id.env $host:/opt/config/kafka/
done
# 为 node4 修改 broker.id 和 advertised.listeners
ssh node4 "sed -i 's/broker.id=3/broker.id=4/' /opt/module/kafka/config/server.properties"
ssh node4 "sed -i 's/SASL_PLAINTEXT:\/\/node3:9092/SASL_PLAINTEXT:\/\/node4:9092/' /opt/module/kafka/config/server.properties"
# 为 node5 修改
ssh node5 "sed -i 's/broker.id=3/broker.id=5/' /opt/module/kafka/config/server.properties"
ssh node5 "sed -i 's/SASL_PLAINTEXT:\/\/node3:9092/SASL_PLAINTEXT:\/\/node5:9092/' /opt/module/kafka/config/server.properties"
# 创建数据目录
for host in node3 node4 node5; do
ssh $host "mkdir -p /opt/data/kafka/data && chown -R causes:causes /opt/data/kafka"
done7.7 格式化 Kafka 存储目录
说明: 在 KRaft 模式下,首次启动前必须使用 kafka-storage.sh format 格式化存储目录,生成元数据日志文件。
bash
source /opt/config/kafka/cluster_id.env
# node3
ssh node3 "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/kafka.service.keytab kafka/node3@BIGDATA.CLUSTER && /opt/module/kafka/bin/kafka-storage.sh format --config /opt/module/kafka/config/server.properties --cluster-id $KAFKA_CLUSTER_ID"
# node4
ssh node4 "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/kafka.service.keytab kafka/node4@BIGDATA.CLUSTER && /opt/module/kafka/bin/kafka-storage.sh format --config /opt/module/kafka/config/server.properties --cluster-id $KAFKA_CLUSTER_ID"
# node5
ssh node5 "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/kafka.service.keytab kafka/node5@BIGDATA.CLUSTER && /opt/module/kafka/bin/kafka-storage.sh format --config /opt/module/kafka/config/server.properties --cluster-id $KAFKA_CLUSTER_ID"预期每个节点输出:
Formatting /opt/data/kafka/data with metadata.version 3.8-IV0
Formatting /opt/data/kafka/data7.8 启动 Kafka
bash
# 在 node3 上启动
ssh node3 "export KAFKA_OPTS='-Djava.security.auth.login.config=/opt/module/kafka/config/kafka_server_jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf' && source /etc/profile.d/custom_profile.sh && nohup /opt/module/kafka/bin/kafka-server-start.sh /opt/module/kafka/config/server.properties > /opt/data/kafka/kafka.log 2>&1 &"
# 在 node4 上启动
ssh node4 "export KAFKA_OPTS='-Djava.security.auth.login.config=/opt/module/kafka/config/kafka_server_jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf' && source /etc/profile.d/custom_profile.sh && nohup /opt/module/kafka/bin/kafka-server-start.sh /opt/module/kafka/config/server.properties > /opt/data/kafka/kafka.log 2>&1 &"
# 在 node5 上启动
ssh node5 "export KAFKA_OPTS='-Djava.security.auth.login.config=/opt/module/kafka/config/kafka_server_jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf' && source /etc/profile.d/custom_profile.sh && nohup /opt/module/kafka/bin/kafka-server-start.sh /opt/module/kafka/config/server.properties > /opt/data/kafka/kafka.log 2>&1 &"
# 等待启动完成后验证进程
sleep 10
for host in node3 node4 node5; do
echo "=== $host ==="
ssh $host "jps | grep -i kafka"
done7.9 Topic 测试
创建客户端 SASL 配置文件:
bash
cat > /opt/module/kafka/config/client.properties << 'EOF'
security.protocol=SASL_PLAINTEXT
sasl.mechanism=GSSAPI
sasl.kerberos.service.name=kafka
EOF测试 Topic 的创建、生产和消费:
bash
# 获取 Kerberos 票据
kinit -kt /opt/config/kerberos/keytabs/kafka.keytab kafka
# 创建测试 Topic (3 分区, 2 副本)
kafka-topics.sh --create \
--bootstrap-server node3:9092,node4:9092,node5:9092 \
--topic test_topic \
--partitions 3 \
--replication-factor 2 \
--command-config /opt/module/kafka/config/client.properties
# 预期: Created topic test_topic.
# 查看 Topic 列表
kafka-topics.sh --list \
--bootstrap-server node3:9092,node4:9092,node5:9092 \
--command-config /opt/module/kafka/config/client.properties
# 查看 Topic 详细信息 (分区分布、Leader、ISR)
kafka-topics.sh --describe \
--bootstrap-server node3:9092 \
--topic test_topic \
--command-config /opt/module/kafka/config/client.properties
# 生产消息 (按 Ctrl+D 结束输入)
echo "Hello Kafka KRaft with Kerberos" | kafka-console-producer.sh \
--bootstrap-server node3:9092,node4:9092,node5:9092 \
--topic test_topic \
--producer.config /opt/module/kafka/config/client.properties
# 消费消息 (从起始位置开始)
kafka-console-consumer.sh \
--bootstrap-server node3:9092,node4:9092,node5:9092 \
--topic test_topic \
--from-beginning \
--consumer.config /opt/module/kafka/config/client.properties
# 预期输出: Hello Kafka KRaft with Kerberos
# 清理测试 Topic
kafka-topics.sh --delete \
--bootstrap-server node3:9092,node4:9092,node5:9092 \
--topic test_topic \
--command-config /opt/module/kafka/config/client.properties8. Spark 配置
8.1 Spark 说明
Spark 4.1.2 配置为 YARN 集群模式运行,客户端提交作业后由 YARN 分配 Container 执行。集成 Kerberos 认证且访问 Hive Metastore。
8.2 spark-env.sh
bash
# 从模板创建
cp /opt/module/spark/conf/spark-env.sh.template /opt/module/spark/conf/spark-env.sh
# 追加环境配置
cat >> /opt/module/spark/conf/spark-env.sh << 'EOF'
# ============================================================
# JDK 和 Hadoop 配置路径
# ============================================================
export JAVA_HOME=/opt/module/jdk8
export HADOOP_CONF_DIR=/opt/module/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/module/hadoop/etc/hadoop
# ============================================================
# Kerberos
# ============================================================
export SPARK_SUBMIT_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf $SPARK_SUBMIT_OPTS"
# ============================================================
# Hive 集成
# ============================================================
export HIVE_CONF_DIR=/opt/module/hive/conf
# ============================================================
# 默认内存 (适配 8GB 节点)
# ============================================================
export SPARK_DRIVER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=2g
EOF8.3 spark-defaults.conf
bash
cat > /opt/module/spark/conf/spark-defaults.conf << 'EOF'
# ============================================================
# Spark 4.1.2 默认配置 (YARN + Kerberos + Hive)
# ============================================================
# --- Master 和部署模式 ---
spark.master yarn
spark.submit.deployMode client
# --- Kerberos 认证 ---
spark.yarn.principal spark@BIGDATA.CLUSTER
spark.yarn.keytab /opt/config/kerberos/keytabs/spark.keytab
# --- YARN 资源分配 (适配 8GB 节点) ---
spark.yarn.am.cores 1
spark.yarn.am.memory 512m
spark.executor.cores 1
spark.executor.memory 2g
spark.executor.instances 2
spark.driver.memory 1g
spark.driver.cores 1
# 关闭动态资源分配 (小集群下避免资源碎片化)
spark.dynamicAllocation.enabled false
# --- Hive Metastore 集成 ---
spark.sql.catalogImplementation hive
spark.sql.warehouse.dir /warehouse
spark.sql.hive.metastore.version 3.1.3
spark.sql.hive.metastore.jars path
spark.sql.hive.metastore.jars.path /opt/module/hive/lib/*
# --- 序列化 ---
spark.serializer org.apache.spark.serializer.KryoSerializer
# --- Shuffle ---
spark.sql.shuffle.partitions 8
# --- 安全 ---
spark.authenticate true
spark.authenticate.seed true
# --- History Event Log (可选,需要时取消注释) ---
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://bigdata-cluster/spark-history
# spark.yarn.historyServer.address node1:18080
EOF8.4 Spark 验证
bash
# 获取 Kerberos 票据
kinit -kt /opt/config/kerberos/keytabs/spark.keytab spark
# 运行 Pi 估算任务 (YARN 集群模式)
spark-submit --master yarn --deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--num-executors 2 \
--executor-memory 1g \
/opt/module/spark/examples/jars/spark-examples_2.13-*.jar 10
# 通过 yarn 查看应用日志 (替换 <app_id>):
# yarn logs -applicationId <app_id>
# 预期输出: Pi is roughly 3.14...9. Flink 配置
9.1 Flink 说明
Flink 2.2.1 配置为 YARN Application 模式,集成 Kerberos 认证、HDFS 状态后端和 ZooKeeper HA。
9.2 flink-conf.yaml
bash
cat > /opt/module/flink/conf/flink-conf.yaml << 'EOF'
# ============================================================
# Flink 2.2.1 配置 (YARN Application + Kerberos + HA)
# ============================================================
# --- JobManager ---
jobmanager.rpc.address: node1
jobmanager.rpc.port: 6123
jobmanager.bind-host: 0.0.0.0
jobmanager.memory.process.size: 1024m
# --- TaskManager ---
taskmanager.bind-host: 0.0.0.0
taskmanager.memory.process.size: 2048m
taskmanager.numberOfTaskSlots: 2
# --- 默认并行度 ---
parallelism.default: 2
# --- REST API / Web UI ---
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
rest.port: 8081
# --- Blob Server ---
blob.server.port: 6124
# --- 状态后端和检查点 ---
state.backend: hashmap
state.checkpoints.dir: hdfs://bigdata-cluster/flink/checkpoints
state.savepoints.dir: hdfs://bigdata-cluster/flink/savepoints
# --- ZooKeeper HA (高可用) ---
high-availability: zookeeper
high-availability.storageDir: hdfs://bigdata-cluster/flink/ha
high-availability.zookeeper.quorum: node3:2181,node4:2181,node5:2181
high-availability.zookeeper.path.root: /flink
# --- 执行目标 ---
execution.target: yarn-application
# --- YARN Application 配置 ---
yarn.application-attempts: 2
yarn.maximum-failed-containers: 3
# --- Kerberos 安全 ---
security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: /opt/config/kerberos/keytabs/flink.keytab
security.kerberos.login.principal: flink@BIGDATA.CLUSTER
security.kerberos.login.contexts: Client,KafkaClient
security.kerberos.fetch.delegation-token: true
# --- Hadoop 配置路径 ---
fs.hdfs.hadoopconf: /opt/module/hadoop/etc/hadoop
# --- ClassLoader (避免与 Hadoop 类冲突) ---
classloader.check-leaked-classloader: false
classloader.resolve-order: parent-first
# --- 历史作业归档 ---
historyserver.archive.clean-expired-jobs: true
historyserver.archive.fs.dir: hdfs://bigdata-cluster/flink/completed-jobs
EOF9.3 Flink 验证
bash
# 获取票据
kinit -kt /opt/config/kerberos/keytabs/flink.keytab flink
# 以 YARN Application 模式运行 WordCount 示例
flink run-application -t yarn-application \
/opt/module/flink/examples/batch/WordCount.jar
# 通过 yarn 查看应用状态:
# yarn application -list
# Flink Web UI:
# http://node1:8081 (Session 模式下)10. Kyuubi 配置
10.1 Kyuubi 说明
Kyuubi 1.11.1 是统一的多租户 Spark SQL 网关,提供标准化 JDBC/Thrift 接口。部署在 node1 上,后端引擎使用 Spark SQL on YARN。
客户端 (JDBC/Beeline) --> Kyuubi Server (node1:10009) --> Spark SQL Engine (YARN Container)JDK 版本要求: Kyuubi 1.11.1 使用
--add-opens等 JDK 9+ 参数,必须使用 JDK 17 启动。但 Spark 引擎(Spark 3.5.8)可以在 JDK 8 上运行(由 YARN NM 的 JDK 决定)。启动命令示例:
bashexport JAVA_HOME=/opt/module/jdk17 kinit -kt /opt/config/kerberos/keytabs/causes.keytab causes@BIGDATA.CLUSTER nohup $JAVA_HOME/bin/java -server -Xms256m -Xmx1024m \ -cp "/opt/module/kyuubi/conf:/opt/module/kyuubi/jars/*:/opt/module/hadoop/etc/hadoop" \ org.apache.kyuubi.server.KyuubiServer \ > /opt/data/ds/kyuubi.log 2>&1 &
10.2 kyuubi-env.sh
bash
cat > /opt/module/kyuubi/conf/kyuubi-env.sh << 'EOF'
#!/usr/bin/env bash
export JAVA_HOME=/opt/module/jdk8
export HADOOP_HOME=/opt/module/hadoop
export HADOOP_CONF_DIR=/opt/module/hadoop/etc/hadoop
export SPARK_HOME=/opt/module/spark
export SPARK_CONF_DIR=/opt/module/spark/conf
export HIVE_HOME=/opt/module/hive
export HIVE_CONF_DIR=/opt/module/hive/conf
# Kyuubi 堆内存 2GB
export KYUUBI_JAVA_OPTS="-Xmx2048m"
# Kerberos
export KYUUBI_JAVA_OPTS="$KYUUBI_JAVA_OPTS -Djava.security.krb5.conf=/etc/krb5.conf"
EOF10.3 kyuubi-defaults.conf
bash
cat > /opt/module/kyuubi/conf/kyuubi-defaults.conf << 'EOF'
# ============================================================
# Kyuubi 1.11.1 默认配置
# ============================================================
# --- 前端 Thrift 服务 ---
kyuubi.frontend.bind.host=0.0.0.0
kyuubi.frontend.bind.port=10009
# --- 后端引擎类型和共享级别 ---
# SPARK_SQL: 使用 Spark SQL 引擎
# CONNECTION: 每个连接一个独立引擎 (隔离性最好)
kyuubi.engine.type=SPARK_SQL
kyuubi.engine.share.level=CONNECTION
# --- Kerberos ---
kyuubi.authentication=KERBEROS
kyuubi.authentication.kerberos.principal=kyubiuser/_HOST@BIGDATA.CLUSTER
kyuubi.authentication.kerberos.keytab=/opt/config/kerberos/keytabs/kyubiuser.keytab
kyuubi.kinit.principal=kyubiuser/_HOST@BIGDATA.CLUSTER
kyuubi.kinit.keytab=/opt/config/kerberos/keytabs/kyubiuser.keytab
# --- Spark SQL Engine 配置 ---
kyuubi.engine.spark.main.resource=org.apache.kyuubi.engine.spark.SparkSQLEngine
spark.master=yarn
spark.submit.deployMode=client
spark.yarn.principal=spark@BIGDATA.CLUSTER
spark.yarn.keytab=/opt/config/kerberos/keytabs/spark.keytab
spark.driver.memory=1g
spark.executor.memory=2g
spark.executor.cores=1
spark.executor.instances=2
spark.sql.catalogImplementation=hive
spark.sql.hive.metastore.version=3.1.3
spark.sql.hive.metastore.jars=path
spark.sql.hive.metastore.jars.path=/opt/module/hive/lib/*
spark.yarn.submit.waitAppCompletion=false
# --- 日志 ---
kyuubi.logging.dir=/opt/data/kyuubi/logs
# --- Session 超时 ---
kyuubi.session.engine.idle.timeout=PT30M
kyuubi.session.idle.timeout=PT1H
EOF10.4 启动 Kyuubi
bash
# 认证
kinit -kt /opt/config/kerberos/keytabs/kyubiuser.keytab kyubiuser
# 创建日志目录
mkdir -p /opt/data/kyuubi/logs
# 启动 Kyuubi 服务
nohup /opt/module/kyuubi/bin/kyuubi start > /opt/data/kyuubi/kyuubi-start.log 2>&1 &
# 查看启动日志
sleep 5
tail -20 /opt/data/kyuubi/logs/kyuubi-*.log验证:
bash
/opt/module/kyuubi/bin/beeline -u "jdbc:hive2://node1:10009/default;principal=kyubiuser/node1@BIGDATA.CLUSTER" -n causes -e "SHOW DATABASES;"11. DolphinScheduler
11.1 DolphinScheduler 说明
DolphinScheduler 3.4.2 是分布式工作流任务调度系统。组件分布:
- Master Server (node1): 负责 DAG 切分、任务调度和命令分发
- API Server (node1): 提供 REST API 和 Web UI
- Worker Server (node3/4/5): 执行具体任务
11.2 确认安装
bash
ls /opt/module/dolphinscheduler/
# 如果未解压:
# cd /opt/software/
# tar -xzf apache-dolphinscheduler-3.4.2-bin.tar.gz -C /opt/module/
# mv /opt/module/apache-dolphinscheduler-*-bin /opt/module/dolphinscheduler11.3 数据库初始化
bash
# 找到 SQL 初始化脚本 (可能在不同位置)
SQL_FILE=$(find /opt/module/dolphinscheduler -name "dolphinscheduler_mysql.sql" 2>/dev/null | head -1)
if [ -n "$SQL_FILE" ]; then
docker exec -i mysql8 mysql -u root -pRoot@2026! dolphinscheduler < "$SQL_FILE"
echo "[OK] DolphinScheduler schema initialized"
else
echo "[WARN] SQL file not found. Please locate and execute manually:"
echo " docker exec -i mysql8 mysql -u root -pRoot@2026! dolphinscheduler < <sql_file>"
fi验证:
bash
docker exec mysql8 mysql -u root -pRoot@2026! -e "USE dolphinscheduler; SHOW TABLES;"
# 预期: 看到 t_ds_* 系列的表11.4 API Server 配置
bash
cat > /opt/module/dolphinscheduler/api-server/conf/application.yaml << 'EOF'
server:
port: 12345
spring:
jackson:
time-zone: Asia/Shanghai
date-format: "yyyy-MM-dd HH:mm:ss"
datasource:
driver-class-name: com.mysql.cj.jdbc.Driver
url: jdbc:mysql://node1:3306/dolphinscheduler?useUnicode=true&characterEncoding=UTF-8&useSSL=false&serverTimezone=Asia/Shanghai&allowPublicKeyRetrieval=true
username: ds
password: Dolphin@2026!
hikari:
connection-test-query: select 1
minimum-idle: 5
maximum-pool-size: 50
connection-timeout: 30000
idle-timeout: 600000
quartz:
properties:
org.quartz.jobStore.isClustered: true
org.quartz.jobStore.clusterCheckinInterval: 5000
org.quartz.scheduler.instanceId: AUTO
registry:
type: zookeeper
zookeeper:
namespace: dolphinscheduler
connect-string: node3:2181,node4:2181,node5:2181
session-timeout: 30s
connection-timeout: 9s
block-until-connected: 6000ms
management:
endpoints:
web:
exposure:
include: health,metrics
endpoint:
health:
enabled: true
security:
authentication:
type: PASSWORD
EOF11.5 Master Server 配置
bash
cat > /opt/module/dolphinscheduler/master-server/conf/application.yaml << 'EOF'
master:
listen-port: 5678
fetch-command-num: 10
pre-exec-threads: 10
exec-threads: 10
dispatch-threads: 10
host-selector: lower_weight
host-weight:
node3: 100
node4: 100
node5: 100
server:
port: 5679
spring:
jackson:
time-zone: Asia/Shanghai
date-format: "yyyy-MM-dd HH:mm:ss"
registry:
type: zookeeper
zookeeper:
namespace: dolphinscheduler
connect-string: node3:2181,node4:2181,node5:2181
session-timeout: 30s
connection-timeout: 9s
EOF11.6 Common Properties (Worker / Kerberos)
bash
cat > /opt/module/dolphinscheduler/worker-server/conf/common.properties << 'EOF'
# ============================================================
# DolphinScheduler Common 配置
# ============================================================
# 数据存储
data.basedir.path=/tmp/dolphinscheduler
# 资源存储类型 (HDFS)
resource.storage.type=HDFS
resource.hdfs.fs.defaultFS=hdfs://bigdata-cluster
resource.hdfs.root.user=causes
# Kerberos
kerberos.startup.enable=true
kerberos.expiration.time=2
kerberos.max.retry=3
kerberos.krb5.conf.path=/etc/krb5.conf
kerberos.java.security.krb5.conf=/etc/krb5.conf
login.user.keytab.username=kyubiuser@BIGDATA.CLUSTER
login.user.keytab.path=/opt/config/kerberos/keytabs/kyubiuser.keytab
# 租户
tenant.default=hadoop
tenant.auto.create=true
# 任务重试
task.commit.retry.count=3
task.commit.retry.interval=1000
# Hadoop 安全
hadoop.security.authentication.startup.state=true
hadoop.security.authentication=kerberos
java.security.krb5.conf.path=/etc/krb5.conf
# YARN 应用状态 URL
yarn.application.status.address=http://node2:8088/ws/v1/cluster/apps/%s
EOF11.7 启动 DolphinScheduler
bash
# 认证
kinit -kt /opt/config/kerberos/keytabs/kyubiuser.keytab kyubiuser
cd /opt/module/dolphinscheduler
# node1: 启动 Master + API + Worker
nohup bash bin/dolphinscheduler-daemon.sh start master-server &
nohup bash bin/dolphinscheduler-daemon.sh start api-server &
nohup bash bin/dolphinscheduler-daemon.sh start worker-server &
# node3/4/5: 启动 Worker
for host in node3 node4 node5; do
ssh $host "source /etc/profile.d/custom_profile.sh && kinit -kt /opt/config/kerberos/keytabs/kyubiuser.keytab kyubiuser && cd /opt/module/dolphinscheduler && nohup bash bin/dolphinscheduler-daemon.sh start worker-server &"
done11.8 访问 Web UI
- URL:
http://node1:12345/dolphinscheduler - 默认用户名:
admin - 默认密码:
dolphinscheduler123
12. HDFS 权限体系
12.1 权限模型概述
在 Kerberos 认证模式下,HDFS 采用 POSIX + ACL 的混合权限模型:
- POSIX 模型: 每位用户一个 owner + 一个 group + others,rwx 权限
- ACL 扩展: 允许为多个用户/组设定精细化权限,支持默认 ACL 继承
12.2 创建 Linux 用户和组
bash
# 创建业务组
sudo groupadd bigdata
sudo groupadd datateam
sudo groupadd aiteam
sudo groupadd bizteam
# 将 causes 加入所有组
sudo usermod -aG bigdata,datateam,aiteam,bizteam causes
# 创建业务用户
sudo useradd -g bigdata -m -s /bin/bash dataeng
sudo useradd -g datateam -m -s /bin/bash analyst
sudo useradd -g aiteam -m -s /bin/bash mlops
sudo useradd -g bizteam -m -s /bin/bash bizuser
# 设置密码
for user in dataeng analyst mlops bizuser; do
echo "$user:${user}@2026" | sudo chpasswd
done12.3 创建 HDFS 目录体系
bash
kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs
# ---- 根目录 ----
hdfs dfs -mkdir -p /warehouse
hdfs dfs -chmod 1777 /warehouse
# ---- Staging 区域 (ETL 数据暂存) ----
hdfs dfs -mkdir -p /staging/{incoming,processing,archived}
hdfs dfs -chown causes:bigdata /staging
hdfs dfs -chmod 770 /staging
hdfs dfs -chmod 770 /staging/incoming
hdfs dfs -chmod 750 /staging/processing
hdfs dfs -chmod 750 /staging/archived
# ---- Production 分层 (数仓标准分层) ----
# RAW: 原始接入层
# ODS: 操作数据存储层
# DWD: 明细数据层
# DWS: 汇总数据层
# ADS: 应用数据层
hdfs dfs -mkdir -p /production/{raw,ods,dwd,dws,ads}
hdfs dfs -chown causes:bigdata /production
hdfs dfs -chmod 750 /production
for layer in raw ods dwd dws ads; do
hdfs dfs -chown causes:bigdata /production/$layer
hdfs dfs -chmod 750 /production/$layer
done
# ---- 用户 Home 目录 ----
for user in causes dataeng analyst mlops bizuser; do
hdfs dfs -mkdir -p /user/$user
hdfs dfs -chown $user:bigdata /user/$user
hdfs dfs -chmod 750 /user/$user
done
# ---- Flink 系统目录 ----
hdfs dfs -mkdir -p /flink/{checkpoints,savepoints,ha,completed-jobs}
hdfs dfs -chown -R causes:bigdata /flink
hdfs dfs -chmod -R 755 /flink
# ---- 验证 ----
hdfs dfs -ls /12.4 ACL 精细化权限控制
bash
# 允许 datateam 组只读 raw 层
hdfs dfs -setfacl -m group:datateam:r-x /production/raw
# 允许 dataeng 用户完全控制 staging/incoming
hdfs dfs -setfacl -m user:dataeng:rwx /staging/incoming
# 设置默认 ACL (新创建的子目录自动继承)
hdfs dfs -setfacl -m default:group:datateam:r-x /production/ods
# 查看 ACL 配置
hdfs dfs -getfacl /production/raw预期 ACL 输出:
# file: /production/raw
# owner: causes
# group: bigdata
user::rwx
group::r-x
group:datateam:r-x
mask::r-x
other::---12.5 Warehouse 目录 ACL(重要)
Hive 的 /warehouse 目录需要在创建后立即设置默认 ACL,确保后续通过 Kyuubi/Spark/Hive 写入数据时不会因 Permission denied 失败:
bash
# 设置默认 ACL:所有新建子目录继承 rwx 权限
hdfs dfs -setfacl -m default:user::rwx,default:group::rwx,default:other::rwx /warehouse
hdfs dfs -chmod 777 /warehouse此步骤已集成到 bigdata_cluster.sh 的 hdfs_start 函数中,HDFS 启动后自动执行。
为什么需要:Metastore 使用 hdfs 身份创建库表目录(hdfs:supergroup),Kyuubi Spark 引擎使用 causes 身份写入数据。没有 ACL 继承时,新建目录归 hdfs 所有,causes 无写入权限。通过默认 ACL,新建目录自动授予 causes 写入权限。
12.6 权限隔离验证
bash
# 以 dataeng 身份尝试写入 causes 的目录 (应被拒绝)
echo "dataeng@2026" | kinit dataeng@BIGDATA.CLUSTER
hdfs dfs -put /tmp/test.txt /user/causes/ 2>&1 | grep "Permission denied"
# 预期: Permission denied: user=dataeng, access=WRITE, inode="/user/causes"13. 一键启停脚本
13.1 脚本设计思路
/opt/script/bigdata_cluster.sh 是集群运维的核心脚本,设计原则:
- 分阶段启停: 基础服务 (ZK/HDFS/YARN) 与上层服务分开,遵循依赖关系
- 启动顺序: ZK → JN → NN → ZKFC → DN → RM → NM → HS → Hive → Kafka → Kyuubi
- 停止顺序: 反向 (上层先停,基础后停)
- 状态检查: 提供一键查看所有节点组件状态的功能
13.2 脚本内容
bash
cat > /opt/script/bigdata_cluster.sh << 'BIGSCRIPT'
#!/bin/bash
# ================================================================
# 大数据集群一键启停脚本 v2.0
# 用法:
# bigdata_cluster.sh start 启动所有组件
# bigdata_cluster.sh stop 停止所有组件
# bigdata_cluster.sh status 查看所有组件状态
# bigdata_cluster.sh start base 仅启动基础服务 (ZK, HDFS, YARN)
# bigdata_cluster.sh start upper 仅启动上层服务 (Hive, Kafka, Kyuubi, DS)
# bigdata_cluster.sh restart 重启全部
# ================================================================
set -e
source /etc/profile.d/custom_profile.sh
# ============================================================
# 节点角色定义
# ============================================================
ALL="node1 node2 node3 node4 node5"
ZK_NODES="node3 node4 node5"
JN_NODES="node3 node4 node5"
DN_NODES="node3 node4 node5"
NM_NODES="node3 node4 node5"
KAFKA_NODES="node3 node4 node5"
DS_WORKER_NODES="node3 node4 node5"
# ============================================================
# Kerberos 认证
# ============================================================
init_kerb() {
echo "[INFO] 获取 HDFS Kerberos 票据..."
kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs
}
# ============================================================
# 启动基础服务
# ============================================================
start_base() {
echo "=========================================="
echo " 启动基础服务 (ZK + HDFS + YARN)"
echo "=========================================="
# 1. ZooKeeper
echo "[1/8] 启动 ZooKeeper..."
for host in $ZK_NODES; do
ssh $host "$ZOOKEEPER_HOME/bin/zkServer.sh start"
done
sleep 3
# 2. JournalNode
echo "[2/8] 启动 JournalNode..."
for host in $JN_NODES; do
ssh $host "$HADOOP_HOME/bin/hdfs --daemon start journalnode"
done
sleep 2
# 3. NameNode Active (node1)
echo "[3/8] 启动 NameNode (node1)..."
ssh node1 "kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && $HADOOP_HOME/bin/hdfs --daemon start namenode"
sleep 2
# 4. NameNode Standby (node2)
echo "[4/8] 启动 NameNode (node2)..."
ssh node2 "kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && $HADOOP_HOME/bin/hdfs --daemon start namenode"
sleep 2
# 5. ZKFC (node1 + node2)
echo "[5/8] 启动 ZKFC..."
ssh node1 "$HADOOP_HOME/bin/hdfs --daemon start zkfc"
ssh node2 "$HADOOP_HOME/bin/hdfs --daemon start zkfc"
sleep 2
# 6. DataNode (node3/4/5)
echo "[6/8] 启动 DataNode..."
for host in $DN_NODES; do
ssh $host "kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && $HADOOP_HOME/bin/hdfs --daemon start datanode"
done
sleep 2
# 7. ResourceManager (node2)
echo "[7/8] 启动 ResourceManager..."
ssh node2 "kinit -kt /opt/config/kerberos/keytabs/yarn.keytab yarn && $HADOOP_HOME/bin/yarn --daemon start resourcemanager"
sleep 2
# 8. NodeManager (node3/4/5)
echo "[8/8] 启动 NodeManager..."
for host in $NM_NODES; do
ssh $host "kinit -kt /opt/config/kerberos/keytabs/yarn.keytab yarn && $HADOOP_HOME/bin/yarn --daemon start nodemanager"
done
sleep 2
echo "[DONE] 基础服务启动完成"
}
# ============================================================
# 启动上层服务
# ============================================================
start_upper() {
echo "=========================================="
echo " 启动上层服务 (HS + Hive + Kafka + Kyuubi + DS)"
echo "=========================================="
# 1. History Server
echo "[1/5] 启动 JobHistoryServer..."
ssh node1 "kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && $HADOOP_HOME/bin/mapred --daemon start historyserver"
sleep 1
# 2. Hive Metastore
echo "[2/5] 启动 Hive Metastore..."
ssh node2 "kinit -kt /opt/config/kerberos/keytabs/hive.service.keytab hive/node2@BIGDATA.CLUSTER && nohup $HIVE_HOME/bin/hive --service metastore > /opt/data/hive/metastore.log 2>&1 &"
sleep 2
# 3. Kafka
echo "[3/5] 启动 Kafka..."
for host in $KAFKA_NODES; do
ssh $host "export KAFKA_OPTS='-Djava.security.auth.login.config=/opt/module/kafka/config/kafka_server_jaas.conf -Djava.security.krb5.conf=/etc/krb5.conf' && nohup $KAFKA_HOME/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties > /opt/data/kafka/kafka.log 2>&1 &"
done
sleep 3
# 4. Kyuubi
echo "[4/5] 启动 Kyuubi..."
ssh node1 "kinit -kt /opt/config/kerberos/keytabs/kyubiuser.keytab kyubiuser && nohup $KYUUBI_HOME/bin/kyuubi start > /opt/data/kyuubi/kyuubi-start.log 2>&1 &"
sleep 2
# 5. DolphinScheduler
echo "[5/5] 启动 DolphinScheduler..."
ssh node1 "kinit -kt /opt/config/kerberos/keytabs/kyubiuser.keytab kyubiuser && cd /opt/module/dolphinscheduler && nohup bash bin/dolphinscheduler-daemon.sh start master-server &"
ssh node1 "cd /opt/module/dolphinscheduler && nohup bash bin/dolphinscheduler-daemon.sh start api-server &"
ssh node1 "cd /opt/module/dolphinscheduler && nohup bash bin/dolphinscheduler-daemon.sh start worker-server &"
for host in $DS_WORKER_NODES; do
ssh $host "kinit -kt /opt/config/kerberos/keytabs/kyubiuser.keytab kyubiuser && cd /opt/module/dolphinscheduler && nohup bash bin/dolphinscheduler-daemon.sh start worker-server &"
done
sleep 3
echo "[DONE] 上层服务启动完成"
}
# ============================================================
# 停止上层服务 (反向顺序)
# ============================================================
stop_upper() {
echo "=========================================="
echo " 停止上层服务"
echo "=========================================="
echo "[1/5] 停止 DolphinScheduler..."
for host in $ALL; do
ssh $host "cd /opt/module/dolphinscheduler 2>/dev/null && bash bin/dolphinscheduler-daemon.sh stop worker-server" 2>/dev/null || true
done
ssh node1 "cd /opt/module/dolphinscheduler 2>/dev/null && bash bin/dolphinscheduler-daemon.sh stop api-server" 2>/dev/null || true
ssh node1 "cd /opt/module/dolphinscheduler 2>/dev/null && bash bin/dolphinscheduler-daemon.sh stop master-server" 2>/dev/null || true
echo "[2/5] 停止 Kyuubi..."
ssh node1 "$KYUUBI_HOME/bin/kyuubi stop" 2>/dev/null || true
echo "[3/5] 停止 Kafka..."
for host in $KAFKA_NODES; do
ssh $host "$KAFKA_HOME/bin/kafka-server-stop.sh" 2>/dev/null || true
done
sleep 3
echo "[4/5] 停止 Hive Metastore..."
ssh node2 "pkill -f 'hive.metastore'" 2>/dev/null || true
echo "[5/5] 停止 HistoryServer..."
ssh node1 "$HADOOP_HOME/bin/mapred --daemon stop historyserver" 2>/dev/null || true
echo "[DONE] 上层服务停止完成"
}
# ============================================================
# 停止基础服务 (反向顺序)
# ============================================================
stop_base() {
echo "=========================================="
echo " 停止基础服务"
echo "=========================================="
echo "[1/8] 停止 NodeManager..."
for host in $NM_NODES; do
ssh $host "$HADOOP_HOME/bin/yarn --daemon stop nodemanager" 2>/dev/null || true
done
echo "[2/8] 停止 ResourceManager..."
ssh node2 "$HADOOP_HOME/bin/yarn --daemon stop resourcemanager" 2>/dev/null || true
echo "[3/8] 停止 DataNode..."
for host in $DN_NODES; do
ssh $host "$HADOOP_HOME/bin/hdfs --daemon stop datanode" 2>/dev/null || true
done
echo "[4/8] 停止 ZKFC..."
ssh node1 "$HADOOP_HOME/bin/hdfs --daemon stop zkfc" 2>/dev/null || true
ssh node2 "$HADOOP_HOME/bin/hdfs --daemon stop zkfc" 2>/dev/null || true
echo "[5/8] 停止 NameNode..."
ssh node1 "$HADOOP_HOME/bin/hdfs --daemon stop namenode" 2>/dev/null || true
ssh node2 "$HADOOP_HOME/bin/hdfs --daemon stop namenode" 2>/dev/null || true
echo "[6/8] 停止 JournalNode..."
for host in $JN_NODES; do
ssh $host "$HADOOP_HOME/bin/hdfs --daemon stop journalnode" 2>/dev/null || true
done
echo "[7/8] 停止 ZooKeeper..."
for host in $ZK_NODES; do
ssh $host "$ZOOKEEPER_HOME/bin/zkServer.sh stop" 2>/dev/null || true
done
sleep 2
echo "[8/8] 销毁 Kerberos 票据..."
kdestroy 2>/dev/null || true
echo "[DONE] 基础服务停止完成"
}
# ============================================================
# 状态检查
# ============================================================
status_all() {
echo "=========================================="
echo " 集群组件状态检查"
echo "=========================================="
echo ""
echo "--- ZooKeeper 状态 ---"
for host in $ZK_NODES; do
echo -n " [$host] "
ssh $host "$ZOOKEEPER_HOME/bin/zkServer.sh status 2>&1 | tail -1"
done
echo ""
echo "--- HDFS HA 状态 ---"
ssh node1 "$HADOOP_HOME/bin/hdfs haadmin -getAllServiceState 2>&1" || echo " [WARN] 无法获取 HA 状态"
echo ""
echo "--- HDFS 集群状态 ---"
ssh node1 "kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs 2>/dev/null && $HADOOP_HOME/bin/hdfs dfsadmin -report 2>&1 | head -20"
echo ""
echo "--- YARN 节点状态 ---"
ssh node2 "$HADOOP_HOME/bin/yarn node -list 2>&1 | head -10" || echo " [WARN] 无法获取 YARN 节点信息"
echo ""
echo "--- JVM 进程一览 ---"
for host in $ALL; do
echo " [$host]"
ssh $host "jps 2>/dev/null" | grep -v " Jps$" | sed 's/^/ /' || echo " (无运行中的 Java 进程)"
done
echo ""
echo "--- 关键端口监听 ---"
echo " [node1]"
ssh node1 "netstat -tlnp 2>/dev/null | grep -E '9871|10009|12345|19890|3306' | awk '{print \" \" \$4, \$7}'" || true
echo " [node2]"
ssh node2 "netstat -tlnp 2>/dev/null | grep -E '9871|8020|8088|9083|10000|8032' | awk '{print \" \" \$4, \$7}'" || true
echo " [node3]"
ssh node3 "netstat -tlnp 2>/dev/null | grep -E '2181|9092|8485' | awk '{print \" \" \$4, \$7}'" || true
}
# ============================================================
# 主入口
# ============================================================
case "$1" in
start)
case "$2" in
base) init_kerb; start_base ;;
upper) init_kerb; start_upper ;;
*) init_kerb; start_base; start_upper ;;
esac
echo ""
echo "可使用 '$0 status' 查看集群状态"
;;
stop)
case "$2" in
base) stop_base ;;
upper) stop_upper ;;
*) stop_upper; stop_base ;;
esac
;;
status)
status_all
;;
restart)
$0 stop
sleep 5
$0 start
;;
*)
echo "用法: $0 {start|stop|restart|status} [base|upper]"
echo ""
echo "命令说明:"
echo " start 启动所有组件 (基础 + 上层)"
echo " stop 停止所有组件"
echo " restart 重启所有组件"
echo " status 查看集群所有组件状态"
echo ""
echo "分阶段操作:"
echo " start base 仅启动基础服务 (ZK -> JN -> NN -> ZKFC -> DN -> RM -> NM)"
echo " start upper 仅启动上层服务 (HS -> Hive -> Kafka -> Kyuubi -> DS)"
echo " stop base 仅停止基础服务"
echo " stop upper 仅停止上层服务"
exit 1
;;
esac
echo ""
echo "=========================================="
echo " 操作完成"
echo "=========================================="
BIGSCRIPT
chmod +x /opt/script/bigdata_cluster.sh13.3 脚本使用示例
bash
# 查看集群当前状态
/opt/script/bigdata_cluster.sh status
# 分阶段启动 (推荐顺序)
/opt/script/bigdata_cluster.sh start base # 先启动基础
/opt/script/bigdata_cluster.sh start upper # 再启动上层
# 一键全部启动
/opt/script/bigdata_cluster.sh start
# 停止全部
/opt/script/bigdata_cluster.sh stop
# 重启
/opt/script/bigdata_cluster.sh restart14. 验证清单
14.1 基础环境验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 1 | 主机名 | hostname | node1 / node2 / node3 / node4 / node5 |
| 2 | hosts 解析 | ping -c1 node2 | 能 ping 通所有节点 |
| 3 | SSH 免密 | ssh node2 hostname | 无需密码,输出 node2 |
| 4 | JDK 8 | java -version 2>&1 | head -1 | 1.8.0_492 |
| 5 | JDK 17 | /opt/module/jdk17/bin/java -version 2>&1 | head -1 | 17.0.19 |
| 6 | 环境变量 | echo $HADOOP_HOME | /opt/module/hadoop |
| 7 | 环境变量 | echo $JAVA_HOME | /opt/module/jdk8 |
| 8 | 目录结构 | ls /opt/module/ | jdk8, jdk17, hadoop, hive, zookeeper, kafka, spark, flink, kyuubi, dolphinscheduler |
| 9 | 数据目录 | ls /opt/data/ | hadoop, hive, kafka, zookeeper, spark, flink, kyuubi, mysql |
| 10 | 分发脚本 | ls /opt/script/xsync.sh | 文件存在且可执行 (+x) |
14.2 MySQL 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 11 | Docker 运行 | docker ps | grep mysql8 | mysql8 容器 Up 状态 |
| 12 | MySQL 连接 | docker exec mysql8 mysql -u root -pRoot@2026! -e "SELECT 1" | 返回 1 |
| 13 | Hive 库 | docker exec mysql8 mysql -u root -pRoot@2026! -e "SHOW DATABASES" | grep hive | hive_metastore |
| 14 | DS 库 | docker exec mysql8 mysql -u root -pRoot@2026! -e "SHOW DATABASES" | grep dolphin | dolphinscheduler |
| 15 | Hive 用户 | docker exec mysql8 mysql -u hive -pHive@2026! -e "SELECT 1" | 返回 1 |
| 16 | DS 用户 | docker exec mysql8 mysql -u ds -pDolphin@2026! -e "SELECT 1" | 返回 1 |
14.3 Kerberos 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 17 | KDC 运行 | sudo systemctl status krb5-kdc | active (running) |
| 18 | Admin 服务 | sudo systemctl status krb5-admin-server | active (running) |
| 19 | 主体列表 | sudo kadmin.local -q "listprincs" | wc -l | > 20 个主体 |
| 20 | Keytab 文件 | ls /opt/config/kerberos/keytabs/*.keytab | wc -l | > 10 个 keytab |
| 21 | 密码认证 | echo "causes123" | kinit causes@BIGDATA.CLUSTER && klist | 显示 TGT 票据 |
| 22 | Keytab 认证 | kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs && klist | 显示 hdfs 的 TGT |
| 23 | 票据销毁 | kdestroy && klist | No credentials cache found |
14.4 ZooKeeper 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 24 | ZK 进程 | ssh node3 "jps | grep QuorumPeer" | QuorumPeerMain 进程存在 |
| 25 | ZK Leader | ssh node3 "/opt/module/zookeeper/bin/zkServer.sh status | grep Mode" | leader 或 follower |
| 26 | ZK 读写 | 客户端 create/get/delete | 操作成功 |
| 27 | HA 节点 | ZK 客户端 ls /hadoop-ha | 看到 bigdata-cluster |
14.5 Hadoop HDFS 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 28 | NameNode 进程 | ssh node1 "jps | grep NameNode" | NameNode 进程存在 |
| 29 | DataNode 进程 | ssh node3 "jps | grep DataNode" | DataNode 进程存在 |
| 30 | JN 进程 | ssh node3 "jps | grep JournalNode" | JournalNode 进程存在 |
| 31 | HA 状态 | hdfs haadmin -getAllServiceState | 一个 active, 一个 standby |
| 32 | HDFS 读写 | hdfs dfs -put /tmp/test.txt / && hdfs dfs -cat /test.txt && hdfs dfs -rm /test.txt | 写入并读取成功 |
| 33 | HDFS 报告 | hdfs dfsadmin -report | grep "Live datanodes" | Live datanodes (3) |
| 34 | Web UI | 浏览器访问 https://node1:9871 | 显示 NameNode 信息页 |
| 35 | 手动故障切换 | hdfs haadmin -failover nn1 nn2 后恢复 | 切换成功 |
14.6 YARN 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 36 | RM 进程 | ssh node2 "jps | grep ResourceManager" | ResourceManager 进程存在 |
| 37 | NM 进程 | ssh node3 "jps | grep NodeManager" | NodeManager 进程存在 |
| 38 | YARN 节点 | yarn node -list | 3 个 NodeManager 均为 RUNNING |
| 39 | Web UI | 浏览器访问 http://node2:8088 | YARN 管理界面 |
| 40 | 提交任务 | yarn jar ... pi 2 2 | 任务完成 |
| 41 | History Server | ssh node1 "jps | grep JobHistoryServer" | 进程存在 |
14.7 Hive Metastore 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 42 | Metastore 进程 | ssh node2 "jps | grep RunJar" | RunJar 进程存在 (metastore) |
| 43 | Thrift 端口 | ssh node2 "netstat -tlnp | grep 9083" | 9083 端口 LISTEN |
| 44 | Beeline 连接 | beeline -u "jdbc:hive2://node2:10000..." -e "SHOW DATABASES" | 显示 default |
| 45 | Metastore 表 | docker exec mysql8 mysql ... -e "USE hive_metastore; SELECT COUNT(*) FROM TBLS" | 返回数值 |
14.8 Kafka 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 46 | Kafka 进程 | ssh node3 "jps | grep Kafka" | Kafka 进程存在 |
| 47 | 端口监听 | ssh node3 "netstat -tlnp | grep 9092" | 9092 端口 LISTEN |
| 48 | Topic 创建 | kafka-topics.sh --create --topic test ... | Created topic test |
| 49 | 消息生产 | echo "test" | kafka-console-producer.sh ... | 无错误 |
| 50 | 消息消费 | kafka-console-consumer.sh ... --from-beginning | 看到 test |
| 51 | Topic 删除 | kafka-topics.sh --delete --topic test ... | 无错误 |
14.9 Spark 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 52 | Spark Pi | spark-submit --master yarn ... SparkPi 10 | Pi is roughly 3.14... |
14.10 Flink 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 53 | Flink 提交 | flink run-application -t yarn-application examples/.../WordCount.jar | 任务完成 |
14.11 Kyuubi 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 54 | Kyuubi 状态 | /opt/module/kyuubi/bin/kyuubi status | Kyuubi Server is running |
| 55 | JDBC 连接 | beeline -u "jdbc:hive2://node1:10009/..." -e "SELECT 1" | 返回 1 |
14.12 DolphinScheduler 验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 56 | DS Master | ssh node1 "jps | grep MasterServer" | MasterServer 进程 |
| 57 | DS API | ssh node1 "jps | grep ApiApplicationServer" | ApiApplicationServer 进程 |
| 58 | DS Worker | ssh node3 "jps | grep WorkerServer" | WorkerServer 进程 |
| 59 | Web UI | 浏览器访问 http://node1:12345/dolphinscheduler | 登录页面 |
14.13 HDFS 权限验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 60 | 目录结构 | hdfs dfs -ls / | 看到 warehouse, staging, production, user, system, flink |
| 61 | ACL 配置 | hdfs dfs -getfacl /production/raw | 显示 group:datateam:r-x |
| 62 | 权限隔离 | dataeng 写入 causes 目录 | Permission denied |
14.14 一键脚本验证
| 序号 | 检查项 | 验证命令 | 预期结果 |
|---|---|---|---|
| 63 | 脚本存在 | ls /opt/script/bigdata_cluster.sh | 可执行 |
| 64 | 状态检查 | /opt/script/bigdata_cluster.sh status | 显示所有组件状态 |
| 65 | 分阶段启动 | /opt/script/bigdata_cluster.sh start base | 基础服务依次启动 |
附录 A: 常用命令速查
A.1 Kerberos
bash
# 获取票据
kinit -kt /opt/config/kerberos/keytabs/hdfs.keytab hdfs
# 查看票据
klist
# 销毁票据
kdestroy
# 查看 keytab 中的主体
klist -kt /opt/config/kerberos/keytabs/hdfs.keytab
# 创建新主体
sudo kadmin.local -q "addprinc -pw <password> <principal>"
# 生成 keytab
sudo kadmin.local -q "ktadd -k <keytab_path> <principal>"A.2 HDFS
bash
# 文件系统操作
hdfs dfs -ls /
hdfs dfs -mkdir /newdir
hdfs dfs -put localfile /target/
hdfs dfs -cat /path/to/file
hdfs dfs -rm -r /path/to/dir
# 管理操作
hdfs dfsadmin -report # 集群状态报告
hdfs dfsadmin -safemode get # 安全模式状态
hdfs haadmin -getAllServiceState # HA 状态
hdfs haadmin -failover nn1 nn2 # 手动故障切换
hdfs dfsadmin -refreshNodes # 刷新节点列表
hdfs dfs -setfacl -m user:name:rwx /path # 设置 ACL
hdfs dfs -getfacl /path # 查看 ACL
# 文件系统检查
hdfs fsck / -files -blocks -locations # 文件系统健康检查A.3 YARN
bash
# 应用管理
yarn application -list # 列出所有应用
yarn application -kill <app_id> # 终止应用
yarn application -status <app_id> # 应用状态
yarn logs -applicationId <app_id> # 查看应用日志
# 节点管理
yarn node -list # 列出 NM 节点
yarn node -status <node_id> # 节点详情
# 队列管理
yarn queue -status <queue_name> # 队列状态A.4 ZooKeeper
bash
# 客户端连接
zkCli.sh -server node3:2181
ls / # 列出根节点
get /path # 读取节点
create /path "data" # 创建节点
delete /path # 删除节点
quit # 退出
# 服务管理
zkServer.sh status # 查看角色
zkServer.sh stop # 停止A.5 Kafka
bash
# Topic 管理
kafka-topics.sh --list --bootstrap-server node3:9092 --command-config client.properties
kafka-topics.sh --create --topic test --partitions 3 --replication-factor 2 ...
kafka-topics.sh --describe --topic test ...
kafka-topics.sh --delete --topic test ...
# 消费/生产
kafka-console-consumer.sh --bootstrap-server node3:9092 --topic test --from-beginning --consumer.config client.properties
kafka-console-producer.sh --bootstrap-server node3:9092 --topic test --producer.config client.properties
# 消费组
kafka-consumer-groups.sh --bootstrap-server node3:9092 --list --command-config client.properties附录 B: 启动顺序完整流程图
启动顺序 (Start) 停止顺序 (Stop - 反向)
────────────────── ──────────────────
1. KDC (node1) 14. DolphinScheduler
│ ▲
2. MySQL (node1) 13. Kyuubi
│ ▲
3. ZooKeeper (3/4/5) 12. Kafka
│ ▲
4. JournalNode (3/4/5) 11. HiveMetastore
│ ▲
5. NameNode (1/2) 10. JobHistoryServer
│ ▲
6. ZKFC (1/2) 9. YARN NodeManager
│ ▲
7. DataNode (3/4/5) 8. YARN ResourceManager
│ ▲
8. ResourceManager (2) 7. DataNode (3/4/5)
│ ▲
9. NodeManager (3/4/5) 6. ZKFC (1/2)
│ ▲
10. JobHistoryServer (1) 5. NameNode (1/2)
│ ▲
11. HiveMetastore (2) 4. JournalNode (3/4/5)
│ ▲
12. Kafka (3/4/5) 3. ZooKeeper (3/4/5)
│ ▲
13. Kyuubi (1) 2. MySQL (node1)
│ ▲
14. DolphinScheduler (1/3/4/5) 1. KDC (node1)附录 C: 端口清单
| 端口 | 组件 | 节点 | 协议 | 说明 |
|---|---|---|---|---|
| 88 | KDC | node1 | TCP/UDP | Kerberos 认证 |
| 749 | KDC Admin | node1 | TCP | Kerberos 管理 |
| 2181 | ZooKeeper | 3/4/5 | TCP | ZK 客户端连接 |
| 2888 | ZooKeeper | 3/4/5 | TCP | ZK Follower-Leader 通信 |
| 3888 | ZooKeeper | 3/4/5 | TCP | ZK Leader 选举 |
| 8020 | NameNode RPC | 1/2 | TCP | HDFS 客户端 RPC |
| 9870 | NameNode HTTP | 1/2 | TCP | HDFS Web UI (HTTP) |
| 9871 | NameNode HTTPS | 1/2 | TCP | HDFS Web UI (HTTPS) |
| 8485 | JournalNode | 3/4/5 | TCP | QJM EditLog 共享 |
| 8030 | RM Scheduler | node2 | TCP | YARN 调度器 |
| 8031 | RM Tracker | node2 | TCP | NM 心跳注册 |
| 8032 | RM App | node2 | TCP | YARN 应用提交 |
| 8088 | RM Web | node2 | TCP | YARN Web UI |
| 8090 | RM Web HTTPS | node2 | TCP | YARN Web UI (HTTPS) |
| 10020 | JobHistory | node1 | TCP | MR History Server |
| 19888 | JH Web | node1 | TCP | MR History Web UI |
| 19890 | JH Web HTTPS | node1 | TCP | MR History Web UI (HTTPS) |
| 3306 | MySQL | node1 | TCP | MySQL 数据库 |
| 9083 | Hive Metastore | node2 | TCP | Metastore Thrift |
| 10000 | HiveServer2 | node2 | TCP | Hive JDBC/Thrift |
| 9092 | Kafka | 3/4/5 | TCP | Kafka Broker |
| 9093 | Kafka Controller | 3/4/5 | TCP | KRaft Controller |
| 10009 | Kyuubi | node1 | TCP | Kyuubi Thrift/JDBC |
| 12345 | DS API | node1 | TCP | DolphinScheduler API/Web |
| 5678 | DS Master | node1 | TCP | DS Master |
| 8081 | Flink WebUI | - | TCP | Flink Web UI |
| 18080 | Spark History | - | TCP | Spark History Server |
附录 D: 密码汇总
| 用途 | 用户名 | 密码 | 说明 |
|---|---|---|---|
| MySQL root | root | Root@2026! | 数据库超级管理员 |
| MySQL hive | hive | Hive@2026! | Hive Metastore 用户 |
| MySQL DS | ds | Dolphin@2026! | DolphinScheduler 用户 |
| KDC 数据库 | - | Causes@2026 | Kerberos 数据库主密钥 |
| Kerberos 管理员 | causes/admin | Causes@2026 | KDC 管理 |
| Kerberos 用户 | hdfs/yarn/hive/spark/flink/kafka/kyubiuser | causes123 | 组件用户 |
| Kerberos 用户 | causes | causes123 | 普通用户 |
| SSL KeyStore | - | hadoop123 | SSL 密钥库密码 |
| DS Web | admin | dolphinscheduler123 | DolphinScheduler Web UI |
文档版本: v1.0
最后更新: 2026-06-18
适用集群: 5 节点 Ubuntu Server 24.04 | Hadoop 3.4.0 HA
维护者: causes