# 7. 모니터링

[목차로 돌아가기](./README.md)

---

## 아키텍처

```
운영서버 (node_exporter:9100) --스크래핑--> CI/CD (Prometheus:9090) --> Grafana:3100
개발서버 (node_exporter:9100) --스크래핑--> CI/CD (Prometheus:9090) --> Grafana:3100
CI/CD   (node_exporter:9100) --스크래핑--> CI/CD (Prometheus:9090) --> Grafana:3100
```

- **Grafana 대시보드:** https://monitor.sam.it.kr
- **Prometheus 쿼리:** CI/CD 서버에서 http://localhost:9090
- **운영서버 메트릭:** 운영서버에서 http://localhost:9100/metrics
- **개발서버 메트릭:** 개발서버에서 http://localhost:9100/metrics

---

## Prometheus 스크래핑 설정

**현재 설정 (/etc/prometheus/prometheus.yml):**

```yaml
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'sam-prod'
    static_configs:
      - targets: ['211.117.60.189:9100']
        labels:
          server: 'production'

  - job_name: 'sam-cicd'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          server: 'cicd'

  - job_name: 'sam-dev'
    static_configs:
      - targets: ['114.203.209.83:9100']
        labels:
          server: 'development'
```

### 스크래핑 대상 추가

```bash
# 1. 대상 서버에 node_exporter 설치 (미설치 시)
#    바이너리: https://github.com/prometheus/node_exporter/releases
#    서비스: /etc/systemd/system/node_exporter.service
#    포트: 9100 (기본)

# 2. 대상 서버 방화벽에서 CI/CD IP 허용
sudo ufw allow from 110.10.147.46 to any port 9100 comment 'Prometheus scraping from CI/CD'

# 3. CI/CD 서버에서 설정 파일 편집
sudo vim /etc/prometheus/prometheus.yml

# 4. 새 대상 추가 예시
#  - job_name: 'sam-new'
#    static_configs:
#      - targets: ['<서버IP>:9100']
#        labels:
#          server: '<환경명>'

# 5. 문법 검사
promtool check config /etc/prometheus/prometheus.yml

# 6. 서비스 리로드
sudo systemctl restart prometheus
```

### 대상 상태 확인

```bash
curl -s http://localhost:9090/api/v1/targets | python3 -c "
import json, sys
data = json.load(sys.stdin)
for t in data['data']['activeTargets']:
    print(f\"{t['labels'].get('job','?'):15} {t['health']:6} {t['scrapeUrl']}\")
"
```

---

## PromQL 쿼리

Prometheus UI (http://localhost:9090) 또는 Grafana에서 사용.

### CPU

```promql
# CPU 사용률 (%) - 서버별
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 유휴 CPU 비율 (5분 평균)
rate(node_cpu_seconds_total{mode="idle"}[5m])
```

### 메모리

```promql
# 사용 가능 메모리 비율
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# 사용 중인 메모리 (GB)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024 / 1024 / 1024

# 전체 메모리 (GB)
node_memory_MemTotal_bytes / 1024 / 1024 / 1024
```

### 디스크

```promql
# 디스크 사용률 (%)
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# 사용 가능 디스크 (GB)
node_filesystem_avail_bytes{mountpoint="/"} / 1024 / 1024 / 1024

# 디스크 I/O (읽기/쓰기 바이트, 5분 평균)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])
```

### 네트워크

```promql
# 수신 (bytes/sec, 5분 평균)
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# 전송 (bytes/sec, 5분 평균)
rate(node_network_transmit_bytes_total{device="eth0"}[5m])
```

### 시스템

```promql
# 서버 업타임 (초)
time() - node_boot_time_seconds

# Load Average (1분)
node_load1

# 열린 파일 디스크립터
node_filefd_allocated
```

---

## Grafana 대시보드

**기본 대시보드:** Node Exporter Full (ID: 1860)

**Data Source:** Prometheus (http://localhost:9090)

### 대시보드 추가 (Import)

1. Grafana 웹 > Dashboards > Import
2. Dashboard ID 입력 (예: 1860)
3. Data Source로 Prometheus 선택
4. Import 클릭

### 알림 규칙 설정

**설정 경로:** Grafana > Alerting > Alert rules

**현재 설정된 알림 규칙 (SAM Alerts 폴더):**

| 규칙명 | 조건 | 대기 시간 | 설명 |
|--------|------|-----------|------|
| CPU 사용률 > 90% | avg(rate(node_cpu_idle[5m])) | 5분 | CPU 과부하 |
| 메모리 사용률 > 85% | MemAvailable/MemTotal | 5분 | 메모리 부족 |
| 디스크 사용률 > 80% | filesystem_avail/size (/) | 5분 | 디스크 공간 부족 |
| 서비스 다운 (스크래핑 실패) | up < 1 | 1분 | Prometheus 타겟 다운 |

**알림 채널:** Grafana > Alerting > Contact points 에서 이메일, Slack 등 설정

**현재 설정:** SAM Slack Contact Point (Incoming Webhook) 연결 완료. Notification Policy에서 SAM Alerts 폴더의 알림이 Slack `#product_infra` 채널로 전송됨.

---

## [운영] 성능 모니터링

### 메모리 사용량 분석

```bash
free -h
ps aux --sort=-%mem | head -16

# MySQL 메모리
sudo mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
sudo mysql -e "SHOW STATUS LIKE 'Innodb_buffer_pool_bytes_data';"

# Redis 메모리
redis-cli info memory | grep -E "used_memory_human|maxmemory_human"

# PHP-FPM 프로세스별 메모리
ps -C php-fpm8.4 -o pid,user,%mem,rss,args --sort=-rss
```

### CPU 모니터링

```bash
htop
uptime                                    # 로드 평균 (1분/5분/15분)
ps aux --sort=-%cpu | head -11           # CPU 상위 프로세스
nproc                                     # CPU 코어 수
```

### 디스크 I/O

```bash
df -h
sudo du -sh /home/webservice/*
sudo du -sh /var/log/*
sudo du -sh /var/lib/mysql/*
sudo iostat -x 1 5                        # 실시간 I/O
```

### 네트워크

```bash
sudo ss -tlnp                            # 열린 포트
ss -s                                     # 연결 상태 요약
sudo ss -tn | awk '{print $4}' | grep -oP ':\d+$' | sort | uniq -c | sort -rn | head -10
```

### PHP-FPM Pool 상태

```bash
ps aux | grep "php-fpm" | grep -v grep | wc -l          # 프로세스 수
ps aux | grep "php-fpm" | grep -v grep | awk '{print $NF}' | sort | uniq -c  # Pool별
sudo grep "max_children" /var/log/php8.4-fpm.log | tail -10  # max_children 도달 여부
```

### MySQL 성능

```bash
# 연결 상태
sudo mysql -e "SHOW STATUS LIKE 'Threads%';"

# Slow Query 요약
sudo mysqldumpslow -s t -t 10 /var/log/mysql/slow.log

# InnoDB Buffer Pool 히트율
sudo mysql -e "
  SELECT
    ROUND((1 - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_reads') /
                (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_read_requests')) * 100, 2) AS buffer_pool_hit_rate_pct;
"

# 테이블 락 대기
sudo mysql -e "SHOW STATUS LIKE 'Table_locks%';"
```

### PM2 모니터링

```bash
pm2 status
pm2 monit                               # 실시간 CPU/메모리
pm2 describe sam-front                   # 상세 정보
pm2 describe sam-front | grep -A5 "restart"   # 재시작 이력
```