步骤/目录:
1.背景介绍
2.Ward
3.ServerStatus
    (1)服务器端
    (2)客户端
    (3)SS的使用

本文首发于个人博客https://lisper517.top/index.php/archives/202/,转载请注明出处。
本文的目的是介绍一些监控linux状态的工具,类似windows的任务管理器。
本文写作日期为2023年12月30日。主要受up主 我不是咕咕鸽 的启发。他还分享了很多其它有趣的内容,讲解也很详细、基础,很适合小白学习。

1.背景介绍

在日常管理服务器的过程中,一些工具能大大减轻工作量。前几天在up主 我不是咕咕鸽 的空间看到了很多有趣的分享,也尝试自己搭建一下。

windows上有任务管理器,可以方便地查看电脑的CPU及内存使用量,网速,硬盘状态等。linux也有类似的工具。

2.Ward

之前笔者介绍过 Pi Dashboard ,一个可以在网页监控树莓派状态的项目。Ward比较类似Pi Dashboard,也能通过网页监控linux电脑的状态。

根据 github-ward 的说明,在linux中如下操作:

mkdir -p /docker/ward
cd /docker/ward
vim docker-compose.yml

内容为:

version: "3.9"
services:
  ward:
    image: antonyleons/ward
    ports:
      - "4000:4000"
    environment:
      - WARD_PORT=4000
    privileged: true
    restart: unless-stopped

打开4000端口,运行ward:

ufw allow 4000 comment "ward"
docker-compose config
docker-compose up

浏览器访问 ip:4000 即可,非常简单。

3.ServerStatus

Ward适用于单个linux机器的监控。如果有多个linux机器,可以使用ServerStatus。

设置一台SS的服务器端,在SS的客户端运行客户端代码,SS客户端就能将运行状态数据提交到SS服务器端,此时访问该服务器端网页就能看到所有客户端机器的状态。

(1)服务器端

mkdir -p /docker/server_status/month_traffic
mkdir -p /docker/server_status/conf/server_status
cd /docker/server_status
wget --no-check-certificate -qO /docker/server_status/conf/server_status/config.json https://raw.githubusercontent.com/cppla/ServerStatus/master/server/config.json
vim docker-compose.yml

内容为:

version: "3.9"
services:
  server_status:
    image: cppla/serverstatus:latest
    container_name: server_status
    ports:
      - "80:80"
      - "35601:35601"
    volumes:
      - /docker/server_status/conf/server_status/config.json:/ServerStatus/server/config.json
      - /docker/server_status/month_traffic:/usr/share/nginx/html/json
    restart: always

80是网页访问,35601是接收客户端数据,记得打开(如果用NPM,把80注释掉,配置反代)。最后运行:

docker-compose config
docker-compose up

此时在网页访问80就能看到了。config.json里自带了几个示例服务器,现在的状态都显示为关闭。

(2)客户端

客户端不需要有公网ip,所以家庭内的服务器也可。首先确认客户端已安装了python:

python3 -V

一般云服务器都是装了的,没有装的参考 runoob
SS客户端只需运行几行命令:

mkdir -p /docker/SS_client
wget --no-check-certificate -qO /docker/SS_client/client-linux.py 'https://raw.githubusercontent.com/cppla/ServerStatus/master/clients/client-linux.py'
chmod +x /docker/SS_client/client-linux.py

该py脚本内容为:

#!/usr/bin/env python3
# coding: utf-8
# Update by : https://github.com/cppla/ServerStatus, Update date: 20220530
# 版本:1.0.3, 支持Python版本:2.7 to 3.10
# 支持操作系统: Linux, OSX, FreeBSD, OpenBSD and NetBSD, both 32-bit and 64-bit architectures
# 说明: 默认情况下修改server和user就可以了。丢包率监测方向可以自定义,例如:CU = "www.facebook.com"。

SERVER = "127.0.0.1"
USER = "s01"


PASSWORD = "USER_DEFAULT_PASSWORD"
PORT = 35601
CU = "cu.tz.cloudcpp.com"
CT = "ct.tz.cloudcpp.com"
CM = "cm.tz.cloudcpp.com"
PROBEPORT = 80
PROBE_PROTOCOL_PREFER = "ipv4"  # ipv4, ipv6
PING_PACKET_HISTORY_LEN = 100
INTERVAL = 1

import socket
import time
import timeit
import re
import os
import sys
import json
import errno
import subprocess
import threading
try:
    from queue import Queue     # python3
except ImportError:
    from Queue import Queue     # python2

def get_uptime():
    with open('/proc/uptime', 'r') as f:
        uptime = f.readline().split('.', 2)
        return int(uptime[0])

def get_memory():
    re_parser = re.compile(r'^(?P<key>\S*):\s*(?P<value>\d*)\s*kB')
    result = dict()
    for line in open('/proc/meminfo'):
        match = re_parser.match(line)
        if not match:
            continue
        key, value = match.groups(['key', 'value'])
        result[key] = int(value)
    MemTotal = float(result['MemTotal'])
    MemUsed = MemTotal-float(result['MemFree'])-float(result['Buffers'])-float(result['Cached'])-float(result['SReclaimable'])
    SwapTotal = float(result['SwapTotal'])
    SwapFree = float(result['SwapFree'])
    return int(MemTotal), int(MemUsed), int(SwapTotal), int(SwapFree)

def get_hdd():
    p = subprocess.check_output(['df', '-Tlm', '--total', '-t', 'ext4', '-t', 'ext3', '-t', 'ext2', '-t', 'reiserfs', '-t', 'jfs', '-t', 'ntfs', '-t', 'fat32', '-t', 'btrfs', '-t', 'fuseblk', '-t', 'zfs', '-t', 'simfs', '-t', 'xfs']).decode("Utf-8")
    total = p.splitlines()[-1]
    used = total.split()[3]
    size = total.split()[2]
    return int(size), int(used)

def get_time():
    with open("/proc/stat", "r") as f:
        time_list = f.readline().split(' ')[2:6]
        for i in range(len(time_list))  :
            time_list[i] = int(time_list[i])
        return time_list

def delta_time():
    x = get_time()
    time.sleep(INTERVAL)
    y = get_time()
    for i in range(len(x)):
        y[i]-=x[i]
    return y

def get_cpu():
    t = delta_time()
    st = sum(t)
    if st == 0:
        st = 1
    result = 100-(t[len(t)-1]*100.00/st)
    return round(result, 1)

def liuliang():
    NET_IN = 0
    NET_OUT = 0
    with open('/proc/net/dev') as f:
        for line in f.readlines():
            netinfo = re.findall('([^\s]+):[\s]{0,}(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)', line)
            if netinfo:
                if netinfo[0][0] == 'lo' or 'tun' in netinfo[0][0] \
                        or 'docker' in netinfo[0][0] or 'veth' in netinfo[0][0] \
                        or 'br-' in netinfo[0][0] or 'vmbr' in netinfo[0][0] \
                        or 'vnet' in netinfo[0][0] or 'kube' in netinfo[0][0] \
                        or netinfo[0][1]=='0' or netinfo[0][9]=='0':
                    continue
                else:
                    NET_IN += int(netinfo[0][1])
                    NET_OUT += int(netinfo[0][9])
    return NET_IN, NET_OUT

def tupd():
    '''
    tcp, udp, process, thread count: for view ddcc attack , then send warning
    :return:
    '''
    s = subprocess.check_output("ss -t|wc -l", shell=True)
    t = int(s[:-1])-1
    s = subprocess.check_output("ss -u|wc -l", shell=True)
    u = int(s[:-1])-1
    s = subprocess.check_output("ps -ef|wc -l", shell=True)
    p = int(s[:-1])-2
    s = subprocess.check_output("ps -eLf|wc -l", shell=True)
    d = int(s[:-1])-2
    return t,u,p,d

def get_network(ip_version):
    if(ip_version == 4):
        HOST = "ipv4.google.com"
    elif(ip_version == 6):
        HOST = "ipv6.google.com"
    try:
        socket.create_connection((HOST, 80), 2).close()
        return True
    except:
        return False

lostRate = {
    '10010': 0.0,
    '189': 0.0,
    '10086': 0.0
}
pingTime = {
    '10010': 0,
    '189': 0,
    '10086': 0
}
netSpeed = {
    'netrx': 0.0,
    'nettx': 0.0,
    'clock': 0.0,
    'diff': 0.0,
    'avgrx': 0,
    'avgtx': 0
}
diskIO = {
    'read': 0,
    'write': 0
}

def _ping_thread(host, mark, port):
    lostPacket = 0
    packet_queue = Queue(maxsize=PING_PACKET_HISTORY_LEN)

    while True:
        # flush dns , every time.
        IP = host
        if host.count(':') < 1:  # if not plain ipv6 address, means ipv4 address or hostname
            try:
                if PROBE_PROTOCOL_PREFER == 'ipv4':
                    IP = socket.getaddrinfo(host, None, socket.AF_INET)[0][4][0]
                else:
                    IP = socket.getaddrinfo(host, None, socket.AF_INET6)[0][4][0]
            except Exception:
                pass

        if packet_queue.full():
            if packet_queue.get() == 0:
                lostPacket -= 1
        try:
            b = timeit.default_timer()
            socket.create_connection((IP, port), timeout=1).close()
            pingTime[mark] = int((timeit.default_timer() - b) * 1000)
            packet_queue.put(1)
        except socket.error as error:
            if error.errno == errno.ECONNREFUSED:
                pingTime[mark] = int((timeit.default_timer() - b) * 1000)
                packet_queue.put(1)
            #elif error.errno == errno.ETIMEDOUT:
            else:
                lostPacket += 1
                packet_queue.put(0)

        if packet_queue.qsize() > 30:
            lostRate[mark] = float(lostPacket) / packet_queue.qsize()

        time.sleep(INTERVAL)

def _net_speed():
    while True:
        with open("/proc/net/dev", "r") as f:
            net_dev = f.readlines()
            avgrx = 0
            avgtx = 0
            for dev in net_dev[2:]:
                dev = dev.split(':')
                if "lo" in dev[0] or "tun" in dev[0] \
                        or "docker" in dev[0] or "veth" in dev[0] \
                        or "br-" in dev[0] or "vmbr" in dev[0] \
                        or "vnet" in dev[0] or "kube" in dev[0]:
                    continue
                dev = dev[1].split()
                avgrx += int(dev[0])
                avgtx += int(dev[8])
            now_clock = time.time()
            netSpeed["diff"] = now_clock - netSpeed["clock"]
            netSpeed["clock"] = now_clock
            netSpeed["netrx"] = int((avgrx - netSpeed["avgrx"]) / netSpeed["diff"])
            netSpeed["nettx"] = int((avgtx - netSpeed["avgtx"]) / netSpeed["diff"])
            netSpeed["avgrx"] = avgrx
            netSpeed["avgtx"] = avgtx
        time.sleep(INTERVAL)

def _disk_io():
    '''
    good luck for opensource! by: cpp.la
    磁盘IO:因为IOPS原因,SSD和HDD、包括RAID卡,ZFS等阵列技术。IO对性能的影响还需要结合自身服务器情况来判断。
    比如我这里是机械硬盘,大量做随机小文件读写,那么很低的读写也就能造成硬盘长时间的等待。
    如果这里做连续性IO,那么普通机械硬盘写入到100Mb/s,那么也能造成硬盘长时间的等待。
    磁盘读写有误差:4k,8k ,https://stackoverflow.com/questions/34413926/psutil-vs-dd-monitoring-disk-i-o
    :return:
    '''
    while True:
        # pre pid snapshot
        snapshot_first = {}
        # next pid snapshot
        snapshot_second = {}
        # read count snapshot
        snapshot_read = 0
        # write count snapshot
        snapshot_write = 0
        # process snapshot
        pid_snapshot = [str(i) for i in os.listdir("/proc") if i.isdigit() is True]
        for pid in pid_snapshot:
            try:
                with open("/proc/{}/io".format(pid)) as f:
                    pid_io = {}
                    for line in f.readlines():
                        if "read_bytes" in line:
                            pid_io["read"] = int(line.split("read_bytes:")[-1].strip())
                        elif "write_bytes" in line and "cancelled_write_bytes" not in line:
                            pid_io["write"] = int(line.split("write_bytes:")[-1].strip())
                    pid_io["name"] = open("/proc/{}/comm".format(pid), "r").read().strip()
                    snapshot_first[pid] = pid_io
            except:
                if pid in snapshot_first:
                    snapshot_first.pop(pid)

        time.sleep(INTERVAL)

        for pid in pid_snapshot:
            try:
                with open("/proc/{}/io".format(pid)) as f:
                    pid_io = {}
                    for line in f.readlines():
                        if "read_bytes" in line:
                            pid_io["read"] = int(line.split("read_bytes:")[-1].strip())
                        elif "write_bytes" in line and "cancelled_write_bytes" not in line:
                            pid_io["write"] = int(line.split("write_bytes:")[-1].strip())
                    pid_io["name"] = open("/proc/{}/comm".format(pid), "r").read().strip()
                    snapshot_second[pid] = pid_io
            except:
                if pid in snapshot_first:
                    snapshot_first.pop(pid)
                if pid in snapshot_second:
                    snapshot_second.pop(pid)

        for k, v in snapshot_first.items():
            if snapshot_first[k]["name"] == snapshot_second[k]["name"] and snapshot_first[k]["name"] != "bash":
                snapshot_read += (snapshot_second[k]["read"] - snapshot_first[k]["read"])
                snapshot_write += (snapshot_second[k]["write"] - snapshot_first[k]["write"])
        diskIO["read"] = snapshot_read
        diskIO["write"] = snapshot_write

def get_realtime_data():
    '''
    real time get system data
    :return:
    '''
    t1 = threading.Thread(
        target=_ping_thread,
        kwargs={
            'host': CU,
            'mark': '10010',
            'port': PROBEPORT
        }
    )
    t2 = threading.Thread(
        target=_ping_thread,
        kwargs={
            'host': CT,
            'mark': '189',
            'port': PROBEPORT
        }
    )
    t3 = threading.Thread(
        target=_ping_thread,
        kwargs={
            'host': CM,
            'mark': '10086',
            'port': PROBEPORT
        }
    )
    t4 = threading.Thread(
        target=_net_speed,
    )
    t5 = threading.Thread(
        target=_disk_io,
    )
    for ti in [t1, t2, t3, t4, t5]:
        ti.daemon = True
        ti.start()

def byte_str(object):
    '''
    bytes to str, str to bytes
    :param object:
    :return:
    '''
    if isinstance(object, str):
        return object.encode(encoding="utf-8")
    elif isinstance(object, bytes):
        return bytes.decode(object)
    else:
        print(type(object))

if __name__ == '__main__':
    for argc in sys.argv:
        if 'SERVER' in argc:
            SERVER = argc.split('SERVER=')[-1]
        elif 'PORT' in argc:
            PORT = int(argc.split('PORT=')[-1])
        elif 'USER' in argc:
            USER = argc.split('USER=')[-1]
        elif 'PASSWORD' in argc:
            PASSWORD = argc.split('PASSWORD=')[-1]
        elif 'INTERVAL' in argc:
            INTERVAL = int(argc.split('INTERVAL=')[-1])
    socket.setdefaulttimeout(30)
    get_realtime_data()
    while True:
        try:
            print("Connecting...")
            s = socket.create_connection((SERVER, PORT))
            data = byte_str(s.recv(1024))
            if data.find("Authentication required") > -1:
                s.send(byte_str(USER + ':' + PASSWORD + '\n'))
                data = byte_str(s.recv(1024))
                if data.find("Authentication successful") < 0:
                    print(data)
                    raise socket.error
            else:
                print(data)
                raise socket.error

            print(data)
            if data.find("You are connecting via") < 0:
                data = byte_str(s.recv(1024))
                print(data)

            timer = 0
            check_ip = 0
            if data.find("IPv4") > -1:
                check_ip = 6
            elif data.find("IPv6") > -1:
                check_ip = 4
            else:
                print(data)
                raise socket.error

            while True:
                CPU = get_cpu()
                NET_IN, NET_OUT = liuliang()
                Uptime = get_uptime()
                Load_1, Load_5, Load_15 = os.getloadavg()
                MemoryTotal, MemoryUsed, SwapTotal, SwapFree = get_memory()
                HDDTotal, HDDUsed = get_hdd()

                array = {}
                if not timer:
                    array['online' + str(check_ip)] = get_network(check_ip)
                    timer = 10
                else:
                    timer -= 1*INTERVAL

                array['uptime'] = Uptime
                array['load_1'] = Load_1
                array['load_5'] = Load_5
                array['load_15'] = Load_15
                array['memory_total'] = MemoryTotal
                array['memory_used'] = MemoryUsed
                array['swap_total'] = SwapTotal
                array['swap_used'] = SwapTotal - SwapFree
                array['hdd_total'] = HDDTotal
                array['hdd_used'] = HDDUsed
                array['cpu'] = CPU
                array['network_rx'] = netSpeed.get("netrx")
                array['network_tx'] = netSpeed.get("nettx")
                array['network_in'] = NET_IN
                array['network_out'] = NET_OUT
                # todo:兼容旧版本,下个版本删除ip_status
                array['ip_status'] = True
                array['ping_10010'] = lostRate.get('10010') * 100
                array['ping_189'] = lostRate.get('189') * 100
                array['ping_10086'] = lostRate.get('10086') * 100
                array['time_10010'] = pingTime.get('10010')
                array['time_189'] = pingTime.get('189')
                array['time_10086'] = pingTime.get('10086')
                array['tcp'], array['udp'], array['process'], array['thread'] = tupd()
                array['io_read'] = diskIO.get("read")
                array['io_write'] = diskIO.get("write")

                s.send(byte_str("update " + json.dumps(array) + "\n"))
        except KeyboardInterrupt:
            raise
        except socket.error:
            print("Disconnected...")
            if 's' in locals().keys():
                del s
            time.sleep(3)
        except Exception as e:
            print("Caught Exception:", e)
            if 's' in locals().keys():
                del s
            time.sleep(3)

最后以后台方式运行SS客户端脚本(但是更建议等下再运行):

nohup python3 /docker/SS_client/client-linux.py SERVER={$SERVER} USER={$USER} PASSWORD={$PASSWORD} >/dev/null 2>&1 &

这里的 {$SERVER} 是SS客户端机器的公网ip; {$USER}{$PASSWORD} 要和服务器端的config.json中能对上,这将在下文介绍。

相比于手动运行,更建议把该py脚本加入开机自启,参考 笔者以前的文章

vim /etc/systemd/system/ss_client.service

写入:

[Unit]
Description=ServerStatus-client
Wants=network-online.target
After=network.target network-online.target
Requires=network-online.target

[Service]
ExecStart=/usr/bin/python3 /docker/SS_client/client-linux.py
ExecStop=/bin/kill $MAINPID
Restart=on-failure
RestartSec=5
StartLimitInterval=0

[Install]
WantedBy=multi-user.target

在启动之前,修改一下 /docker/SS_client/client-linux.py ,填入自己修改的 {$SERVER}{$USER}{$PASSWORD}
最后启动一下service:

systemctl enable ss_client.service
systemctl start ss_client.service
systemctl status ss_client.service

(3)SS的使用

初始的/docker/server_status/conf/server_status/config.json内容为:

{
    "servers": [
        {
            "username": "s01",
            "name": "node1",
            "type": "xen",
            "host": "host1",
            "location": "🇨🇳",
            "password": "USER_DEFAULT_PASSWORD",
            "monthstart": 1
        },
        {
            "username": "s02",
            "name": "node2",
            "type": "vmware",
            "host": "host2",
            "location": "🇯🇵",
            "password": "USER_DEFAULT_PASSWORD",
            "monthstart": 1
        },
        {
            "disabled": true,
            "username": "s03",
            "name": "node3",
            "type": "hyper",
            "host": "host3",
            "location": "🇫🇷",
            "password": "USER_DEFAULT_PASSWORD",
            "monthstart": 1
        },
        {
            "username": "s04",
            "name": "node4",
            "type": "kvm",
            "host": "host4",
            "location": "🇰🇷",
            "password": "USER_DEFAULT_PASSWORD",
            "monthstart": 1
        }
    ],
    "watchdog": [
        {
            "name": "cpu high warning,exclude username s01",
            "rule": "cpu>90&load_1>5&username!='s01'",
            "interval": 600,
            "callback": "https://yourSMSurl"
        },
        {
            "name": "memory high warning, exclude less than 1GB vps",
            "rule": "(memory_used/memory_total)*100>90&memory_total>1048576",
            "interval": 300,
            "callback": "https://yourSMSurl"
        },
        {
            "name": "offline warning,exclude name node1",
            "rule": "online4=0&online6=0&name!='node1'",
            "interval": 600,
            "callback": "https://yourSMSurl"
        },
        {
                        "name": "ddcc attack,limit type Oracle",
                        "rule": "tcp_count>600&type='Oracle'",
                        "interval": 300,
                        "callback": "https://yourSMSurl"
                },
        {
                        "name": "month traffic warning",
                        "rule": "(network_out-last_network_out)/1024/1024/1024>999",
                        "interval": 3600,
                        "callback": "https://yourSMSurl"
                },
        {
            "name": "you can parse an expression combining any known field",
            "rule": "load_5>3",
            "interval": 900,
            "callback": "https://yourSMSurl"
        }
    ]
}

上面的servers字段是客户端设置,客户端的 {$USER}{$PASSWORD} 分别和这里的 username 、 password 键对上即可;name是展示在SS服务端网页的名称,type是服务器的类型(参考 这篇文章 ),host可以填客户端服务器具体在哪座城市(但是这个默认不显示),location是国别, "monthstart": 1 的意思是每个月1号,流量计数会清零(这个也可以设置成累计、不清零)。

下面的watchdog字段是一些警告设置,参考 github-SS

其它的自己可以看情况改,改完后重启一下SS的服务器端。

在SS服务端网页,最左边的协议是指这个服务器自己有无v4、v6地址,最右边的CU、CT、CM是联通、电信、移动网络的丢包率。每个服务器还能点开下拉菜单进一步查看。

标签: docker, ward, serverstatus

已有 2 条评论

  1. [...]Reference【玩转docker】反向代理神器:Nginx Proxy Manager - 知乎linux状态监控-Ward , ServerStatus - 西区代码小仓库一个易上手的NGINX反代程序——NginxProxyManager - 静心前行报错在添加 ssl 时会遇到问题,查看日志发现报错:No module named ‘zope’。解决方案://进入容器,[...]

  2. [...]Reference【玩转docker】反向代理神器:Nginx Proxy Manager - 知乎linux状态监控-Ward , ServerStatus - 西区代码小仓库一个易上手的NGINX反代程序——NginxProxyManager - 静心前行报错在添加 ssl 时会遇到问题,查看日志发现报错:No module named ‘zope’。解决方案://进入容器,[...]

添加新评论