linux状态监控-Ward , ServerStatus
步骤/目录:
1.背景介绍
2.Ward
3.ServerStatus
(1)服务器端
(2)客户端
(3)SS的使用
本文首发于个人博客https://lisper517.top/index.php/archives/202/
,转载请注明出处。
本文的目的是介绍一些监控linux状态的工具,类似windows的任务管理器。
本文写作日期为2023年12月30日。主要受up主 我不是咕咕鸽 的启发。他还分享了很多其它有趣的内容,讲解也很详细、基础,很适合小白学习。
1.背景介绍
在日常管理服务器的过程中,一些工具能大大减轻工作量。前几天在up主 我不是咕咕鸽 的空间看到了很多有趣的分享,也尝试自己搭建一下。
windows上有任务管理器,可以方便地查看电脑的CPU及内存使用量,网速,硬盘状态等。linux也有类似的工具。
2.Ward
之前笔者介绍过 Pi Dashboard ,一个可以在网页监控树莓派状态的项目。Ward比较类似Pi Dashboard,也能通过网页监控linux电脑的状态。
根据 github-ward 的说明,在linux中如下操作:
mkdir -p /docker/ward
cd /docker/ward
vim docker-compose.yml
内容为:
version: "3.9"
services:
ward:
image: antonyleons/ward
ports:
- "4000:4000"
environment:
- WARD_PORT=4000
privileged: true
restart: unless-stopped
打开4000端口,运行ward:
ufw allow 4000 comment "ward"
docker-compose config
docker-compose up
浏览器访问 ip:4000 即可,非常简单。
3.ServerStatus
Ward适用于单个linux机器的监控。如果有多个linux机器,可以使用ServerStatus。
设置一台SS的服务器端,在SS的客户端运行客户端代码,SS客户端就能将运行状态数据提交到SS服务器端,此时访问该服务器端网页就能看到所有客户端机器的状态。
(1)服务器端
mkdir -p /docker/server_status/month_traffic
mkdir -p /docker/server_status/conf/server_status
cd /docker/server_status
wget --no-check-certificate -qO /docker/server_status/conf/server_status/config.json https://raw.githubusercontent.com/cppla/ServerStatus/master/server/config.json
vim docker-compose.yml
内容为:
version: "3.9"
services:
server_status:
image: cppla/serverstatus:latest
container_name: server_status
ports:
- "80:80"
- "35601:35601"
volumes:
- /docker/server_status/conf/server_status/config.json:/ServerStatus/server/config.json
- /docker/server_status/month_traffic:/usr/share/nginx/html/json
restart: always
80是网页访问,35601是接收客户端数据,记得打开(如果用NPM,把80注释掉,配置反代)。最后运行:
docker-compose config
docker-compose up
此时在网页访问80就能看到了。config.json里自带了几个示例服务器,现在的状态都显示为关闭。
(2)客户端
客户端不需要有公网ip,所以家庭内的服务器也可。首先确认客户端已安装了python:
python3 -V
一般云服务器都是装了的,没有装的参考 runoob 。
SS客户端只需运行几行命令:
mkdir -p /docker/SS_client
wget --no-check-certificate -qO /docker/SS_client/client-linux.py 'https://raw.githubusercontent.com/cppla/ServerStatus/master/clients/client-linux.py'
chmod +x /docker/SS_client/client-linux.py
该py脚本内容为:
#!/usr/bin/env python3
# coding: utf-8
# Update by : https://github.com/cppla/ServerStatus, Update date: 20220530
# 版本:1.0.3, 支持Python版本:2.7 to 3.10
# 支持操作系统: Linux, OSX, FreeBSD, OpenBSD and NetBSD, both 32-bit and 64-bit architectures
# 说明: 默认情况下修改server和user就可以了。丢包率监测方向可以自定义,例如:CU = "www.facebook.com"。
SERVER = "127.0.0.1"
USER = "s01"
PASSWORD = "USER_DEFAULT_PASSWORD"
PORT = 35601
CU = "cu.tz.cloudcpp.com"
CT = "ct.tz.cloudcpp.com"
CM = "cm.tz.cloudcpp.com"
PROBEPORT = 80
PROBE_PROTOCOL_PREFER = "ipv4" # ipv4, ipv6
PING_PACKET_HISTORY_LEN = 100
INTERVAL = 1
import socket
import time
import timeit
import re
import os
import sys
import json
import errno
import subprocess
import threading
try:
from queue import Queue # python3
except ImportError:
from Queue import Queue # python2
def get_uptime():
with open('/proc/uptime', 'r') as f:
uptime = f.readline().split('.', 2)
return int(uptime[0])
def get_memory():
re_parser = re.compile(r'^(?P<key>\S*):\s*(?P<value>\d*)\s*kB')
result = dict()
for line in open('/proc/meminfo'):
match = re_parser.match(line)
if not match:
continue
key, value = match.groups(['key', 'value'])
result[key] = int(value)
MemTotal = float(result['MemTotal'])
MemUsed = MemTotal-float(result['MemFree'])-float(result['Buffers'])-float(result['Cached'])-float(result['SReclaimable'])
SwapTotal = float(result['SwapTotal'])
SwapFree = float(result['SwapFree'])
return int(MemTotal), int(MemUsed), int(SwapTotal), int(SwapFree)
def get_hdd():
p = subprocess.check_output(['df', '-Tlm', '--total', '-t', 'ext4', '-t', 'ext3', '-t', 'ext2', '-t', 'reiserfs', '-t', 'jfs', '-t', 'ntfs', '-t', 'fat32', '-t', 'btrfs', '-t', 'fuseblk', '-t', 'zfs', '-t', 'simfs', '-t', 'xfs']).decode("Utf-8")
total = p.splitlines()[-1]
used = total.split()[3]
size = total.split()[2]
return int(size), int(used)
def get_time():
with open("/proc/stat", "r") as f:
time_list = f.readline().split(' ')[2:6]
for i in range(len(time_list)) :
time_list[i] = int(time_list[i])
return time_list
def delta_time():
x = get_time()
time.sleep(INTERVAL)
y = get_time()
for i in range(len(x)):
y[i]-=x[i]
return y
def get_cpu():
t = delta_time()
st = sum(t)
if st == 0:
st = 1
result = 100-(t[len(t)-1]*100.00/st)
return round(result, 1)
def liuliang():
NET_IN = 0
NET_OUT = 0
with open('/proc/net/dev') as f:
for line in f.readlines():
netinfo = re.findall('([^\s]+):[\s]{0,}(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)', line)
if netinfo:
if netinfo[0][0] == 'lo' or 'tun' in netinfo[0][0] \
or 'docker' in netinfo[0][0] or 'veth' in netinfo[0][0] \
or 'br-' in netinfo[0][0] or 'vmbr' in netinfo[0][0] \
or 'vnet' in netinfo[0][0] or 'kube' in netinfo[0][0] \
or netinfo[0][1]=='0' or netinfo[0][9]=='0':
continue
else:
NET_IN += int(netinfo[0][1])
NET_OUT += int(netinfo[0][9])
return NET_IN, NET_OUT
def tupd():
'''
tcp, udp, process, thread count: for view ddcc attack , then send warning
:return:
'''
s = subprocess.check_output("ss -t|wc -l", shell=True)
t = int(s[:-1])-1
s = subprocess.check_output("ss -u|wc -l", shell=True)
u = int(s[:-1])-1
s = subprocess.check_output("ps -ef|wc -l", shell=True)
p = int(s[:-1])-2
s = subprocess.check_output("ps -eLf|wc -l", shell=True)
d = int(s[:-1])-2
return t,u,p,d
def get_network(ip_version):
if(ip_version == 4):
HOST = "ipv4.google.com"
elif(ip_version == 6):
HOST = "ipv6.google.com"
try:
socket.create_connection((HOST, 80), 2).close()
return True
except:
return False
lostRate = {
'10010': 0.0,
'189': 0.0,
'10086': 0.0
}
pingTime = {
'10010': 0,
'189': 0,
'10086': 0
}
netSpeed = {
'netrx': 0.0,
'nettx': 0.0,
'clock': 0.0,
'diff': 0.0,
'avgrx': 0,
'avgtx': 0
}
diskIO = {
'read': 0,
'write': 0
}
def _ping_thread(host, mark, port):
lostPacket = 0
packet_queue = Queue(maxsize=PING_PACKET_HISTORY_LEN)
while True:
# flush dns , every time.
IP = host
if host.count(':') < 1: # if not plain ipv6 address, means ipv4 address or hostname
try:
if PROBE_PROTOCOL_PREFER == 'ipv4':
IP = socket.getaddrinfo(host, None, socket.AF_INET)[0][4][0]
else:
IP = socket.getaddrinfo(host, None, socket.AF_INET6)[0][4][0]
except Exception:
pass
if packet_queue.full():
if packet_queue.get() == 0:
lostPacket -= 1
try:
b = timeit.default_timer()
socket.create_connection((IP, port), timeout=1).close()
pingTime[mark] = int((timeit.default_timer() - b) * 1000)
packet_queue.put(1)
except socket.error as error:
if error.errno == errno.ECONNREFUSED:
pingTime[mark] = int((timeit.default_timer() - b) * 1000)
packet_queue.put(1)
#elif error.errno == errno.ETIMEDOUT:
else:
lostPacket += 1
packet_queue.put(0)
if packet_queue.qsize() > 30:
lostRate[mark] = float(lostPacket) / packet_queue.qsize()
time.sleep(INTERVAL)
def _net_speed():
while True:
with open("/proc/net/dev", "r") as f:
net_dev = f.readlines()
avgrx = 0
avgtx = 0
for dev in net_dev[2:]:
dev = dev.split(':')
if "lo" in dev[0] or "tun" in dev[0] \
or "docker" in dev[0] or "veth" in dev[0] \
or "br-" in dev[0] or "vmbr" in dev[0] \
or "vnet" in dev[0] or "kube" in dev[0]:
continue
dev = dev[1].split()
avgrx += int(dev[0])
avgtx += int(dev[8])
now_clock = time.time()
netSpeed["diff"] = now_clock - netSpeed["clock"]
netSpeed["clock"] = now_clock
netSpeed["netrx"] = int((avgrx - netSpeed["avgrx"]) / netSpeed["diff"])
netSpeed["nettx"] = int((avgtx - netSpeed["avgtx"]) / netSpeed["diff"])
netSpeed["avgrx"] = avgrx
netSpeed["avgtx"] = avgtx
time.sleep(INTERVAL)
def _disk_io():
'''
good luck for opensource! by: cpp.la
磁盘IO:因为IOPS原因,SSD和HDD、包括RAID卡,ZFS等阵列技术。IO对性能的影响还需要结合自身服务器情况来判断。
比如我这里是机械硬盘,大量做随机小文件读写,那么很低的读写也就能造成硬盘长时间的等待。
如果这里做连续性IO,那么普通机械硬盘写入到100Mb/s,那么也能造成硬盘长时间的等待。
磁盘读写有误差:4k,8k ,https://stackoverflow.com/questions/34413926/psutil-vs-dd-monitoring-disk-i-o
:return:
'''
while True:
# pre pid snapshot
snapshot_first = {}
# next pid snapshot
snapshot_second = {}
# read count snapshot
snapshot_read = 0
# write count snapshot
snapshot_write = 0
# process snapshot
pid_snapshot = [str(i) for i in os.listdir("/proc") if i.isdigit() is True]
for pid in pid_snapshot:
try:
with open("/proc/{}/io".format(pid)) as f:
pid_io = {}
for line in f.readlines():
if "read_bytes" in line:
pid_io["read"] = int(line.split("read_bytes:")[-1].strip())
elif "write_bytes" in line and "cancelled_write_bytes" not in line:
pid_io["write"] = int(line.split("write_bytes:")[-1].strip())
pid_io["name"] = open("/proc/{}/comm".format(pid), "r").read().strip()
snapshot_first[pid] = pid_io
except:
if pid in snapshot_first:
snapshot_first.pop(pid)
time.sleep(INTERVAL)
for pid in pid_snapshot:
try:
with open("/proc/{}/io".format(pid)) as f:
pid_io = {}
for line in f.readlines():
if "read_bytes" in line:
pid_io["read"] = int(line.split("read_bytes:")[-1].strip())
elif "write_bytes" in line and "cancelled_write_bytes" not in line:
pid_io["write"] = int(line.split("write_bytes:")[-1].strip())
pid_io["name"] = open("/proc/{}/comm".format(pid), "r").read().strip()
snapshot_second[pid] = pid_io
except:
if pid in snapshot_first:
snapshot_first.pop(pid)
if pid in snapshot_second:
snapshot_second.pop(pid)
for k, v in snapshot_first.items():
if snapshot_first[k]["name"] == snapshot_second[k]["name"] and snapshot_first[k]["name"] != "bash":
snapshot_read += (snapshot_second[k]["read"] - snapshot_first[k]["read"])
snapshot_write += (snapshot_second[k]["write"] - snapshot_first[k]["write"])
diskIO["read"] = snapshot_read
diskIO["write"] = snapshot_write
def get_realtime_data():
'''
real time get system data
:return:
'''
t1 = threading.Thread(
target=_ping_thread,
kwargs={
'host': CU,
'mark': '10010',
'port': PROBEPORT
}
)
t2 = threading.Thread(
target=_ping_thread,
kwargs={
'host': CT,
'mark': '189',
'port': PROBEPORT
}
)
t3 = threading.Thread(
target=_ping_thread,
kwargs={
'host': CM,
'mark': '10086',
'port': PROBEPORT
}
)
t4 = threading.Thread(
target=_net_speed,
)
t5 = threading.Thread(
target=_disk_io,
)
for ti in [t1, t2, t3, t4, t5]:
ti.daemon = True
ti.start()
def byte_str(object):
'''
bytes to str, str to bytes
:param object:
:return:
'''
if isinstance(object, str):
return object.encode(encoding="utf-8")
elif isinstance(object, bytes):
return bytes.decode(object)
else:
print(type(object))
if __name__ == '__main__':
for argc in sys.argv:
if 'SERVER' in argc:
SERVER = argc.split('SERVER=')[-1]
elif 'PORT' in argc:
PORT = int(argc.split('PORT=')[-1])
elif 'USER' in argc:
USER = argc.split('USER=')[-1]
elif 'PASSWORD' in argc:
PASSWORD = argc.split('PASSWORD=')[-1]
elif 'INTERVAL' in argc:
INTERVAL = int(argc.split('INTERVAL=')[-1])
socket.setdefaulttimeout(30)
get_realtime_data()
while True:
try:
print("Connecting...")
s = socket.create_connection((SERVER, PORT))
data = byte_str(s.recv(1024))
if data.find("Authentication required") > -1:
s.send(byte_str(USER + ':' + PASSWORD + '\n'))
data = byte_str(s.recv(1024))
if data.find("Authentication successful") < 0:
print(data)
raise socket.error
else:
print(data)
raise socket.error
print(data)
if data.find("You are connecting via") < 0:
data = byte_str(s.recv(1024))
print(data)
timer = 0
check_ip = 0
if data.find("IPv4") > -1:
check_ip = 6
elif data.find("IPv6") > -1:
check_ip = 4
else:
print(data)
raise socket.error
while True:
CPU = get_cpu()
NET_IN, NET_OUT = liuliang()
Uptime = get_uptime()
Load_1, Load_5, Load_15 = os.getloadavg()
MemoryTotal, MemoryUsed, SwapTotal, SwapFree = get_memory()
HDDTotal, HDDUsed = get_hdd()
array = {}
if not timer:
array['online' + str(check_ip)] = get_network(check_ip)
timer = 10
else:
timer -= 1*INTERVAL
array['uptime'] = Uptime
array['load_1'] = Load_1
array['load_5'] = Load_5
array['load_15'] = Load_15
array['memory_total'] = MemoryTotal
array['memory_used'] = MemoryUsed
array['swap_total'] = SwapTotal
array['swap_used'] = SwapTotal - SwapFree
array['hdd_total'] = HDDTotal
array['hdd_used'] = HDDUsed
array['cpu'] = CPU
array['network_rx'] = netSpeed.get("netrx")
array['network_tx'] = netSpeed.get("nettx")
array['network_in'] = NET_IN
array['network_out'] = NET_OUT
# todo:兼容旧版本,下个版本删除ip_status
array['ip_status'] = True
array['ping_10010'] = lostRate.get('10010') * 100
array['ping_189'] = lostRate.get('189') * 100
array['ping_10086'] = lostRate.get('10086') * 100
array['time_10010'] = pingTime.get('10010')
array['time_189'] = pingTime.get('189')
array['time_10086'] = pingTime.get('10086')
array['tcp'], array['udp'], array['process'], array['thread'] = tupd()
array['io_read'] = diskIO.get("read")
array['io_write'] = diskIO.get("write")
s.send(byte_str("update " + json.dumps(array) + "\n"))
except KeyboardInterrupt:
raise
except socket.error:
print("Disconnected...")
if 's' in locals().keys():
del s
time.sleep(3)
except Exception as e:
print("Caught Exception:", e)
if 's' in locals().keys():
del s
time.sleep(3)
最后以后台方式运行SS客户端脚本(但是更建议等下再运行):
nohup python3 /docker/SS_client/client-linux.py SERVER={$SERVER} USER={$USER} PASSWORD={$PASSWORD} >/dev/null 2>&1 &
这里的 {$SERVER}
是SS客户端机器的公网ip; {$USER}
、 {$PASSWORD}
要和服务器端的config.json中能对上,这将在下文介绍。
相比于手动运行,更建议把该py脚本加入开机自启,参考 笔者以前的文章 :
vim /etc/systemd/system/ss_client.service
写入:
[Unit]
Description=ServerStatus-client
Wants=network-online.target
After=network.target network-online.target
Requires=network-online.target
[Service]
ExecStart=/usr/bin/python3 /docker/SS_client/client-linux.py
ExecStop=/bin/kill $MAINPID
Restart=on-failure
RestartSec=5
StartLimitInterval=0
[Install]
WantedBy=multi-user.target
在启动之前,修改一下 /docker/SS_client/client-linux.py
,填入自己修改的 {$SERVER}
、 {$USER}
、 {$PASSWORD}
。
最后启动一下service:
systemctl enable ss_client.service
systemctl start ss_client.service
systemctl status ss_client.service
(3)SS的使用
初始的/docker/server_status/conf/server_status/config.json内容为:
{
"servers": [
{
"username": "s01",
"name": "node1",
"type": "xen",
"host": "host1",
"location": "🇨🇳",
"password": "USER_DEFAULT_PASSWORD",
"monthstart": 1
},
{
"username": "s02",
"name": "node2",
"type": "vmware",
"host": "host2",
"location": "🇯🇵",
"password": "USER_DEFAULT_PASSWORD",
"monthstart": 1
},
{
"disabled": true,
"username": "s03",
"name": "node3",
"type": "hyper",
"host": "host3",
"location": "🇫🇷",
"password": "USER_DEFAULT_PASSWORD",
"monthstart": 1
},
{
"username": "s04",
"name": "node4",
"type": "kvm",
"host": "host4",
"location": "🇰🇷",
"password": "USER_DEFAULT_PASSWORD",
"monthstart": 1
}
],
"watchdog": [
{
"name": "cpu high warning,exclude username s01",
"rule": "cpu>90&load_1>5&username!='s01'",
"interval": 600,
"callback": "https://yourSMSurl"
},
{
"name": "memory high warning, exclude less than 1GB vps",
"rule": "(memory_used/memory_total)*100>90&memory_total>1048576",
"interval": 300,
"callback": "https://yourSMSurl"
},
{
"name": "offline warning,exclude name node1",
"rule": "online4=0&online6=0&name!='node1'",
"interval": 600,
"callback": "https://yourSMSurl"
},
{
"name": "ddcc attack,limit type Oracle",
"rule": "tcp_count>600&type='Oracle'",
"interval": 300,
"callback": "https://yourSMSurl"
},
{
"name": "month traffic warning",
"rule": "(network_out-last_network_out)/1024/1024/1024>999",
"interval": 3600,
"callback": "https://yourSMSurl"
},
{
"name": "you can parse an expression combining any known field",
"rule": "load_5>3",
"interval": 900,
"callback": "https://yourSMSurl"
}
]
}
上面的servers字段是客户端设置,客户端的 {$USER}
、 {$PASSWORD}
分别和这里的 username 、 password 键对上即可;name是展示在SS服务端网页的名称,type是服务器的类型(参考 这篇文章 ),host可以填客户端服务器具体在哪座城市(但是这个默认不显示),location是国别, "monthstart": 1
的意思是每个月1号,流量计数会清零(这个也可以设置成累计、不清零)。
下面的watchdog字段是一些警告设置,参考 github-SS 。
其它的自己可以看情况改,改完后重启一下SS的服务器端。
在SS服务端网页,最左边的协议是指这个服务器自己有无v4、v6地址,最右边的CU、CT、CM是联通、电信、移动网络的丢包率。每个服务器还能点开下拉菜单进一步查看。
[...]Reference【玩转docker】反向代理神器:Nginx Proxy Manager - 知乎linux状态监控-Ward , ServerStatus - 西区代码小仓库一个易上手的NGINX反代程序——NginxProxyManager - 静心前行报错在添加 ssl 时会遇到问题,查看日志发现报错:No module named ‘zope’。解决方案://进入容器,[...]
[...]Reference【玩转docker】反向代理神器:Nginx Proxy Manager - 知乎linux状态监控-Ward , ServerStatus - 西区代码小仓库一个易上手的NGINX反代程序——NginxProxyManager - 静心前行报错在添加 ssl 时会遇到问题,查看日志发现报错:No module named ‘zope’。解决方案://进入容器,[...]