Files

권혁성 db63fcff85 refactor: [docs] 팀별 폴더 구조 재편 (공유/개발/프론트/기획)

- 개발팀 전용 폴더 dev/ 생성 (standards, guides, quickstart, changes, deploys, data, history, dev_plans 이동)
- 프론트엔드 전용 폴더 frontend/ 생성 (api/ → frontend/api-specs/)
- 기획팀 폴더 requests/ 생성
- plans/ → dev/dev_plans/ 이름 변경
- README.md 신규 (사람용 안내), INDEX.md 재작성 (Claude Code용)
- resources.md 신규 (노션 링크용, assets/brochure 이관 예정)
- CURRENT_WORKS.md 삭제, TODO.md → dev/ 이동
- 전체 참조 경로 업데이트

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-05 16:46:03 +09:00

6.9 KiB

Raw Blame History

7. 모니터링

목차로 돌아가기

아키텍처

운영서버 (node_exporter:9100) --스크래핑--> CI/CD (Prometheus:9090) --> Grafana:3100
개발서버 (node_exporter:9100) --스크래핑--> CI/CD (Prometheus:9090) --> Grafana:3100
CI/CD   (node_exporter:9100) --스크래핑--> CI/CD (Prometheus:9090) --> Grafana:3100

Grafana 대시보드: https://monitor.sam.it.kr
Prometheus 쿼리: CI/CD 서버에서 http://localhost:9090
운영서버 메트릭: 운영서버에서 http://localhost:9100/metrics
개발서버 메트릭: 개발서버에서 http://localhost:9100/metrics

Prometheus 스크래핑 설정

현재 설정 (/etc/prometheus/prometheus.yml):

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'sam-prod'
    static_configs:
      - targets: ['211.117.60.189:9100']
        labels:
          server: 'production'

  - job_name: 'sam-cicd'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          server: 'cicd'

  - job_name: 'sam-dev'
    static_configs:
      - targets: ['114.203.209.83:9100']
        labels:
          server: 'development'

스크래핑 대상 추가

# 1. 대상 서버에 node_exporter 설치 (미설치 시)
#    바이너리: https://github.com/prometheus/node_exporter/releases
#    서비스: /etc/systemd/system/node_exporter.service
#    포트: 9100 (기본)

# 2. 대상 서버 방화벽에서 CI/CD IP 허용
sudo ufw allow from 110.10.147.46 to any port 9100 comment 'Prometheus scraping from CI/CD'

# 3. CI/CD 서버에서 설정 파일 편집
sudo vim /etc/prometheus/prometheus.yml

# 4. 새 대상 추가 예시
#  - job_name: 'sam-new'
#    static_configs:
#      - targets: ['<서버IP>:9100']
#        labels:
#          server: '<환경명>'

# 5. 문법 검사
promtool check config /etc/prometheus/prometheus.yml

# 6. 서비스 리로드
sudo systemctl restart prometheus

대상 상태 확인

curl -s http://localhost:9090/api/v1/targets | python3 -c "
import json, sys
data = json.load(sys.stdin)
for t in data['data']['activeTargets']:
    print(f\"{t['labels'].get('job','?'):15} {t['health']:6} {t['scrapeUrl']}\")
"

PromQL 쿼리

Prometheus UI (http://localhost:9090) 또는 Grafana에서 사용.

CPU

# CPU 사용률 (%) - 서버별
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 유휴 CPU 비율 (5분 평균)
rate(node_cpu_seconds_total{mode="idle"}[5m])

메모리

# 사용 가능 메모리 비율
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# 사용 중인 메모리 (GB)
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 1024 / 1024 / 1024

# 전체 메모리 (GB)
node_memory_MemTotal_bytes / 1024 / 1024 / 1024

디스크

# 디스크 사용률 (%)
100 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100)

# 사용 가능 디스크 (GB)
node_filesystem_avail_bytes{mountpoint="/"} / 1024 / 1024 / 1024

# 디스크 I/O (읽기/쓰기 바이트, 5분 평균)
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

네트워크

# 수신 (bytes/sec, 5분 평균)
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# 전송 (bytes/sec, 5분 평균)
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

시스템

# 서버 업타임 (초)
time() - node_boot_time_seconds

# Load Average (1분)
node_load1

# 열린 파일 디스크립터
node_filefd_allocated

Grafana 대시보드

기본 대시보드: Node Exporter Full (ID: 1860)

Data Source: Prometheus (http://localhost:9090)

대시보드 추가 (Import)

Grafana 웹 > Dashboards > Import
Dashboard ID 입력 (예: 1860)
Data Source로 Prometheus 선택
Import 클릭

알림 규칙 설정

설정 경로: Grafana > Alerting > Alert rules

현재 설정된 알림 규칙 (SAM Alerts 폴더):

규칙명	조건	대기 시간	설명
CPU 사용률 > 90%	avg(rate(node_cpu_idle[5m]))	5분	CPU 과부하
메모리 사용률 > 85%	MemAvailable/MemTotal	5분	메모리 부족
디스크 사용률 > 80%	filesystem_avail/size (/)	5분	디스크 공간 부족
서비스 다운 (스크래핑 실패)	up < 1	1분	Prometheus 타겟 다운

알림 채널: Grafana > Alerting > Contact points 에서 이메일, Slack 등 설정

현재 설정: SAM Slack Contact Point (Incoming Webhook) 연결 완료. Notification Policy에서 SAM Alerts 폴더의 알림이 Slack #product_infra 채널로 전송됨.

[운영] 성능 모니터링

메모리 사용량 분석

free -h
ps aux --sort=-%mem | head -16

# MySQL 메모리
sudo mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
sudo mysql -e "SHOW STATUS LIKE 'Innodb_buffer_pool_bytes_data';"

# Redis 메모리
redis-cli info memory | grep -E "used_memory_human|maxmemory_human"

# PHP-FPM 프로세스별 메모리
ps -C php-fpm8.4 -o pid,user,%mem,rss,args --sort=-rss

CPU 모니터링

htop
uptime                                    # 로드 평균 (1분/5분/15분)
ps aux --sort=-%cpu | head -11           # CPU 상위 프로세스
nproc                                     # CPU 코어 수

디스크 I/O

df -h
sudo du -sh /home/webservice/*
sudo du -sh /var/log/*
sudo du -sh /var/lib/mysql/*
sudo iostat -x 1 5                        # 실시간 I/O

네트워크

sudo ss -tlnp                            # 열린 포트
ss -s                                     # 연결 상태 요약
sudo ss -tn | awk '{print $4}' | grep -oP ':\d+$' | sort | uniq -c | sort -rn | head -10

PHP-FPM Pool 상태

ps aux | grep "php-fpm" | grep -v grep | wc -l          # 프로세스 수
ps aux | grep "php-fpm" | grep -v grep | awk '{print $NF}' | sort | uniq -c  # Pool별
sudo grep "max_children" /var/log/php8.4-fpm.log | tail -10  # max_children 도달 여부

MySQL 성능

# 연결 상태
sudo mysql -e "SHOW STATUS LIKE 'Threads%';"

# Slow Query 요약
sudo mysqldumpslow -s t -t 10 /var/log/mysql/slow.log

# InnoDB Buffer Pool 히트율
sudo mysql -e "
  SELECT
    ROUND((1 - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_reads') /
                (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_read_requests')) * 100, 2) AS buffer_pool_hit_rate_pct;
"

# 테이블 락 대기
sudo mysql -e "SHOW STATUS LIKE 'Table_locks%';"

PM2 모니터링

pm2 status
pm2 monit                               # 실시간 CPU/메모리
pm2 describe sam-front                   # 상세 정보
pm2 describe sam-front | grep -A5 "restart"   # 재시작 이력

6.9 KiB Raw Blame History

7. 모니터링

아키텍처

Prometheus 스크래핑 설정

스크래핑 대상 추가

대상 상태 확인

PromQL 쿼리

CPU

메모리

디스크

네트워크

시스템

Grafana 대시보드

대시보드 추가 (Import)

알림 규칙 설정

[운영] 성능 모니터링

메모리 사용량 분석

CPU 모니터링

디스크 I/O

네트워크

PHP-FPM Pool 상태

MySQL 성능

PM2 모니터링

6.9 KiB

Raw Blame History