Files

김보곤 2ed9d07901 docs:음성입력(STT) 기술 가이드 문서 작성

- Web Speech API 기반 VoiceInputButton 컴포넌트 상세 설명
- interim/final 텍스트 렌더링 규칙, 프리뷰 패널 UI 스펙
- SpeechRecognition 설정 옵션, 이벤트 핸들러 상세
- 새 페이지 적용 체크리스트 (프론트/백엔드)
- 백엔드 STT 사용량 추적 (AiTokenHelper) 패턴
- 트러블슈팅 가이드 (HTTPS, 권한, 언마운트 등)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-10 09:09:41 +09:00

25 KiB

Raw Blame History

음성 입력(STT) 기술 가이드

문서 버전: 1.0 작성일: 2026-02-10 최초 적용: 공사현장 사진대지 (/juil/construction-photos) 대상 프로젝트: MNG (React 18 + Babel in-browser)

1. 개요

1.1 목적

텍스트 입력 필드(input, textarea)에 마이크 버튼을 배치하여, 사용자가 음성으로 텍스트를 입력할 수 있게 하는 브라우저 내장 STT(Speech-to-Text) 기능.

1.2 기술 선택

방식	비용	정확도	지연	채택
Web Speech API (브라우저 내장)	무료	높음 (Google STT 엔진)	실시간	채택
Google Cloud STT API	유료 ($0.006/15초)	매우 높음	서버 왕복	미채택
Whisper (OpenAI)	유료 ($0.006/분)	매우 높음	서버 왕복	미채택

선택 이유: 브라우저 내장 Web Speech API는 Chrome 기반에서 Google STT 엔진을 무료로 사용하며, 실시간 스트리밍으로 interim/final 결과를 즉시 받을 수 있다. 비용 없이 충분한 한국어 인식률을 제공한다.

1.3 브라우저 지원

브라우저	지원	비고
Chrome (Desktop/Android)	✅	최적 지원, Google STT 엔진 사용
Edge	✅	Chromium 기반
Safari (iOS/macOS)	✅	`webkitSpeechRecognition`
Firefox	❌	미지원 (버튼 자동 숨김)

2. 핵심 개념: Interim vs Final

Web Speech API의 핵심은 미확정(interim) 텍스트와 확정(final) 텍스트의 구분이다.

2.1 텍스트 상태 흐름

[음성 입력 시작]
    │
    ├─ interim: "안녕하"          ← 인식 진행 중 (수정될 수 있음)
    ├─ interim: "안녕하세"         ← 교정 발생 (이전 interim 덮어씀)
    ├─ interim: "안녕하세요"       ← 교정 발생
    │
    ├─ ★ FINAL: "안녕하세요"      ← 확정! (절대 삭제 불가)
    │
    ├─ interim: "반갑습"          ← 새로운 인식 시작
    ├─ interim: "반갑습니다"
    │
    ├─ ★ FINAL: "반갑습니다"      ← 확정!
    │
[음성 입력 종료]

2.2 렌더링 규칙 (필수 준수)

상태	스타일	동작	삭제 가능
interim (미확정)	`italic` + `text-gray-400`	실시간 교정됨. 이전 interim을 덮어씀	교정만 허용
final (확정)	`font-normal` + `text-white`	`finalizedSegments[]` 배열에 영구 추가	절대 불가

2.3 input 반영 규칙

final 이벤트 발생 시에만 onResult(transcript) 호출하여 input에 텍스트 추가
interim 텍스트는 프리뷰 패널에만 표시하고, input에는 반영하지 않음
input에 추가된 텍스트는 사용자가 직접 수정 가능 (일반 텍스트)

3. 컴포넌트 아키텍처

3.1 VoiceInputButton 컴포넌트

┌─────────────────────────────────┐
│  VoiceInputButton               │
│                                 │
│  Props:                         │
│    onResult: (text) => void     │  ← final 텍스트만 전달
│    disabled: boolean            │  ← 비활성화 (읽기 모드 등)
│                                 │
│  State:                         │
│    recording: boolean           │  ← 녹음 중 여부
│    finalizedSegments: string[]  │  ← 확정 텍스트 누적 (프리뷰용)
│    interimText: string          │  ← 현재 미확정 텍스트
│                                 │
│  Refs:                          │
│    recognitionRef               │  ← SpeechRecognition 인스턴스
│    startTimeRef                 │  ← 녹음 시작 시각 (사용량 추적)
│    dismissTimerRef              │  ← 프리뷰 닫기 타이머
│    previewRef                   │  ← 프리뷰 DOM (자동 스크롤)
│                                 │
│  Output:                        │
│    [마이크 버튼] + [프리뷰 패널] │
└─────────────────────────────────┘

3.2 전체 코드

function VoiceInputButton({ onResult, disabled }) {
    const [recording, setRecording] = useState(false);
    const [finalizedSegments, setFinalizedSegments] = useState([]);
    const [interimText, setInterimText] = useState('');
    const recognitionRef = useRef(null);
    const startTimeRef = useRef(null);
    const dismissTimerRef = useRef(null);
    const previewRef = useRef(null);

    // 브라우저 지원 확인
    const isSupported = typeof window !== 'undefined' &&
        (window.SpeechRecognition || window.webkitSpeechRecognition);

    // STT 사용량 로깅 (AI 토큰 사용량 추적)
    const logUsage = useCallback((startTime) => {
        const duration = Math.max(1, Math.round((Date.now() - startTime) / 1000));
        apiFetch(API.logSttUsage, {
            method: 'POST',
            headers: { 'Content-Type': 'application/json' },
            body: JSON.stringify({ duration_seconds: duration }),
        }).catch(() => {});
    }, []);

    // 프리뷰 패널 자동 스크롤
    useEffect(() => {
        if (previewRef.current) {
            previewRef.current.scrollTop = previewRef.current.scrollHeight;
        }
    }, [finalizedSegments, interimText]);

    // 녹음 중지
    const stopRecording = useCallback(() => {
        recognitionRef.current?.stop();
        recognitionRef.current = null;
        if (startTimeRef.current) {
            logUsage(startTimeRef.current);
            startTimeRef.current = null;
        }
        setRecording(false);
        setInterimText('');
        // 녹음 종료 후 2초 뒤 프리뷰 닫기
        dismissTimerRef.current = setTimeout(() => {
            setFinalizedSegments([]);
        }, 2000);
    }, [logUsage]);

    // 녹음 시작
    const startRecording = useCallback(() => {
        // 이전 타이머 정리
        if (dismissTimerRef.current) {
            clearTimeout(dismissTimerRef.current);
            dismissTimerRef.current = null;
        }

        const SR = window.SpeechRecognition || window.webkitSpeechRecognition;
        const recognition = new SR();
        recognition.lang = 'ko-KR';           // 한국어
        recognition.continuous = true;          // 연속 인식 (자동 종료 안 함)
        recognition.interimResults = true;      // interim 결과 수신
        recognition.maxAlternatives = 1;        // 후보 1개만

        recognition.onresult = (event) => {
            // dismiss 타이머 취소 (아직 인식 중)
            if (dismissTimerRef.current) {
                clearTimeout(dismissTimerRef.current);
                dismissTimerRef.current = null;
            }

            let currentInterim = '';
            for (let i = event.resultIndex; i < event.results.length; i++) {
                const transcript = event.results[i][0].transcript;
                if (event.results[i].isFinal) {
                    // ★ 확정: input에 반영 + 프리뷰에 영구 저장
                    onResult(transcript);
                    setFinalizedSegments(prev => [...prev, transcript]);
                    currentInterim = '';
                } else {
                    // 미확정: 교정은 허용하되 이전 확정분은 보존
                    currentInterim = transcript;
                }
            }
            setInterimText(currentInterim);
        };

        recognition.onerror = () => stopRecording();

        recognition.onend = () => {
            // 브라우저가 자동 종료한 경우 처리
            if (startTimeRef.current) {
                logUsage(startTimeRef.current);
                startTimeRef.current = null;
            }
            setRecording(false);
            setInterimText('');
            recognitionRef.current = null;
            dismissTimerRef.current = setTimeout(() => {
                setFinalizedSegments([]);
            }, 2000);
        };

        recognitionRef.current = recognition;
        startTimeRef.current = Date.now();
        setFinalizedSegments([]);
        setInterimText('');
        recognition.start();
        setRecording(true);
    }, [onResult, stopRecording, logUsage]);

    // 토글 (시작/중지)
    const toggle = useCallback((e) => {
        e.preventDefault();
        e.stopPropagation();
        if (disabled || !isSupported) return;
        recording ? stopRecording() : startRecording();
    }, [disabled, isSupported, recording, stopRecording, startRecording]);

    // 컴포넌트 언마운트 시 정리
    useEffect(() => {
        return () => {
            recognitionRef.current?.stop();
            if (dismissTimerRef.current) clearTimeout(dismissTimerRef.current);
        };
    }, []);

    // 미지원 브라우저에서는 렌더링하지 않음
    if (!isSupported) return null;

    const hasContent = finalizedSegments.length > 0 || interimText;

    return (
        <div className="relative flex-shrink-0">
            {/* 마이크 버튼 */}
            <button
                type="button"
                onClick={toggle}
                disabled={disabled}
                title={recording ? '녹음 중지 (클릭)' : '음성으로 입력'}
                className={`inline-flex items-center justify-center w-8 h-8 rounded-full transition-all
                    ${recording
                        ? 'bg-red-500 text-white shadow-lg shadow-red-200'
                        : 'bg-gray-100 text-gray-500 hover:bg-blue-100 hover:text-blue-600'}
                    ${disabled ? 'opacity-30 cursor-not-allowed' : 'cursor-pointer'}`}
            >
                {recording ? (
                    <span className="relative flex items-center justify-center w-4 h-4">
                        <span className="absolute inset-0 rounded-full bg-white/30 animate-ping" />
                        <svg className="w-3.5 h-3.5 relative" fill="currentColor" viewBox="0 0 24 24">
                            <rect x="6" y="6" width="12" height="12" rx="2" />
                        </svg>
                    </span>
                ) : (
                    <svg className="w-4 h-4" fill="currentColor" viewBox="0 0 24 24">
                        <path d="M12 14c1.66 0 3-1.34 3-3V5c0-1.66-1.34-3-3-3S9 3.34
                            9 5v6c0 1.66 1.34 3 3 3z" />
                        <path d="M17 11c0 2.76-2.24 5-5 5s-5-2.24-5-5H5c0 3.53 2.61
                            6.43 6 6.92V21h2v-3.08c3.39-.49 6-3.39 6-6.92h-2z" />
                    </svg>
                )}
            </button>

            {/* 스트리밍 프리뷰 패널 */}
            {(recording || hasContent) && (
                <div
                    ref={previewRef}
                    className="absolute bottom-full mb-2 right-0 bg-gray-900 rounded-lg
                        shadow-xl z-50 w-[300px] max-h-[120px] overflow-y-auto px-3 py-2"
                    style={{ lineHeight: '1.6' }}
                >
                    {/* 확정 텍스트: 일반체 + 흰색 */}
                    {finalizedSegments.map((seg, i) => (
                        <span key={i} className="text-white text-xs font-normal
                            transition-colors duration-300">
                            {seg}
                        </span>
                    ))}

                    {/* 미확정 텍스트: 이탤릭 + 연한 회색 */}
                    {interimText && (
                        <span className="text-gray-400 text-xs italic
                            transition-colors duration-200">
                            {interimText}
                        </span>
                    )}

                    {/* 녹음 중 + 텍스트 없음: 대기 표시 */}
                    {recording && !hasContent && (
                        <span className="text-gray-500 text-xs flex items-center gap-1.5">
                            <span className="inline-block w-1.5 h-1.5 bg-red-400
                                rounded-full animate-pulse" />
                            말씀하세요...
                        </span>
                    )}

                    {/* 녹음 종료 후 확정 텍스트 완료 표시 */}
                    {!recording && finalizedSegments.length > 0 && !interimText && (
                        <span className="text-green-400 text-xs ml-1">&#10003;</span>
                    )}
                </div>
            )}
        </div>
    );
}

4. 사용 패턴

4.1 기본 사용법 (input 옆에 배치)

function MyForm() {
    const [value, setValue] = useState('');

    return (
        <div>
            <label className="block text-sm font-medium text-gray-700 mb-1">
                현장명 *
            </label>
            <div className="flex items-center gap-2">
                <input
                    type="text"
                    value={value}
                    onChange={e => setValue(e.target.value)}
                    className="flex-1 px-3 py-2 border border-gray-300 rounded-lg text-sm"
                    placeholder="입력하세요"
                />
                <VoiceInputButton
                    onResult={(text) => setValue(prev =>
                        prev ? prev + ' ' + text : text
                    )}
                />
            </div>
        </div>
    );
}

4.2 textarea와 함께 사용

<div className="flex items-start gap-2">  {/* items-start: 상단 정렬 */}
    <textarea
        value={description}
        onChange={e => setDescription(e.target.value)}
        className="flex-1 px-3 py-2 border rounded-lg text-sm"
        rows={3}
    />
    <VoiceInputButton
        onResult={(text) => setDescription(prev =>
            prev ? prev + ' ' + text : text
        )}
    />
</div>

4.3 조건부 활성화 (수정 모드에서만)

<VoiceInputButton
    onResult={(text) => setSiteName(prev => prev ? prev + ' ' + text : text)}
    disabled={!editing}  // 수정 모드가 아닐 때 비활성화
/>

4.4 onResult 콜백 패턴

// 패턴 1: 기존 텍스트에 이어붙이기 (공백 구분)
onResult={(text) => setValue(prev => prev ? prev + ' ' + text : text)}

// 패턴 2: 덮어쓰기
onResult={(text) => setValue(text)}

// 패턴 3: 커스텀 후처리
onResult={(text) => {
    const cleaned = text.trim().replace(/\s+/g, ' ');
    setValue(prev => prev + ' ' + cleaned);
}}

5. 프리뷰 패널 UI 상세

5.1 위치와 스타일

                    ┌─────────────────────────────┐
                    │ 확정텍스트 미확정텍스트...     │  ← 프리뷰 패널
                    │ (흰색,일반체) (회색,이탤릭)    │     bg-gray-900
                    └─────────────────────────────┘     w-[300px]
                                               ┌──┐    max-h-[120px]
                                               │🎤│    line-height: 1.6
                                               └──┘

위치: 버튼 상단 (absolute bottom-full mb-2 right-0)
배경: 다크 (bg-gray-900) - 밝은 폼 위에서 눈에 잘 띔
너비: 300px 고정, 높이 최대 120px (스크롤)
자동 스크롤: 텍스트가 길어지면 하단으로 자동 스크롤

5.2 상태별 표시

상태	표시 내용
녹음 시작 직후 (텍스트 없음)	🔴 `말씀하세요...` (빨간 점 + 회색 텍스트)
interim 수신 중	확정 텍스트(흰) + 미확정 텍스트(회색 이탤릭)
final 확정 순간	이전 확정 + 새 확정(흰) 추가, interim 초기화
녹음 종료 직후	모든 확정 텍스트 + ✓ 표시(녹색)
종료 후 2초	패널 자동 닫힘 (`finalizedSegments` 초기화)

5.3 transition 설정

확정 텍스트:  transition-colors duration-300  (0.3초 색상 전환)
미확정 텍스트: transition-colors duration-200  (0.2초 색상 전환)
line-height:  1.6 고정 (줄 높이 변동 방지)

6. SpeechRecognition 설정 상세

6.1 주요 옵션

const recognition = new SpeechRecognition();
recognition.lang = 'ko-KR';           // 언어 (한국어)
recognition.continuous = true;          // 연속 인식 모드
recognition.interimResults = true;      // interim 결과 수신
recognition.maxAlternatives = 1;        // 인식 후보 수

옵션	값	설명
`lang`	`'ko-KR'`	한국어 인식. 다국어 필요 시 변경
`continuous`	`true`	말을 멈춰도 자동 종료하지 않음. 사용자가 직접 중지
`interimResults`	`true`	미확정 결과를 실시간 수신 (false면 final만)
`maxAlternatives`	`1`	인식 결과 후보 1개만 (속도 최적화)

6.2 이벤트 핸들러

이벤트	발생 시점	처리
`onresult`	인식 결과 수신	interim/final 구분 후 상태 업데이트
`onerror`	인식 오류	녹음 중지
`onend`	인식 세션 종료	정리 + 사용량 로깅 + 프리뷰 dismiss 타이머

6.3 onresult 이벤트 상세

recognition.onresult = (event) => {
    // event.resultIndex: 이번 이벤트에서 변경된 결과의 시작 인덱스
    // event.results: SpeechRecognitionResultList (누적)
    // event.results[i].isFinal: 확정 여부
    // event.results[i][0].transcript: 인식된 텍스트

    for (let i = event.resultIndex; i < event.results.length; i++) {
        const transcript = event.results[i][0].transcript;
        if (event.results[i].isFinal) {
            // → input에 반영 + finalizedSegments에 추가
        } else {
            // → interimText 업데이트 (이전 interim 덮어씀)
        }
    }
};

주의: event.resultIndex부터 순회해야 한다. 전체(0부터)를 순회하면 이미 처리한 final 결과를 중복 처리하게 된다.

7. 백엔드 (STT 사용량 추적)

7.1 라우트

// routes/web.php (juil 그룹 내)
Route::post('/construction-photos/log-stt-usage',
    [ConstructionSitePhotoController::class, 'logSttUsage']
)->name('construction-photos.log-stt-usage');

7.2 컨트롤러

public function logSttUsage(Request $request): JsonResponse
{
    $validated = $request->validate([
        'duration_seconds' => 'required|integer|min:1',
    ]);

    AiTokenHelper::saveSttUsage(
        '공사현장사진대지-음성입력',  // 메뉴명 (사용처 식별)
        $validated['duration_seconds']
    );

    return response()->json(['success' => true]);
}

7.3 AiTokenHelper::saveSttUsage

// App\Helpers\AiTokenHelper

/**
 * STT 사용량 기록
 * - 과금 기준: $0.009 / 15초
 * - Google Cloud Speech-to-Text 기준 단가
 *
 * @param string $menuName  사용처 메뉴명
 * @param int    $durationSeconds  녹음 시간(초)
 */
public static function saveSttUsage(string $menuName, int $durationSeconds): void

7.4 새 페이지에 STT 적용 시 라우트 추가 패턴

// 1. 컨트롤러에 logSttUsage 메서드 추가
public function logSttUsage(Request $request): JsonResponse
{
    $validated = $request->validate([
        'duration_seconds' => 'required|integer|min:1',
    ]);

    AiTokenHelper::saveSttUsage(
        '새메뉴명-음성입력',      // ← 메뉴명 변경
        $validated['duration_seconds']
    );

    return response()->json(['success' => true]);
}

// 2. 라우트 등록
Route::post('/new-page/log-stt-usage', [NewController::class, 'logSttUsage'])
    ->name('new-page.log-stt-usage');

// 3. 프론트엔드 API 객체에 추가
const API = {
    logSttUsage: '/path/to/log-stt-usage',
};

8. 새 페이지에 음성 입력 적용 체크리스트

8.1 프론트엔드

□ 1. VoiceInputButton 컴포넌트 코드 복사 (또는 공통 모듈화 후 import)
□ 2. API 객체에 logSttUsage 엔드포인트 추가
□ 3. input/textarea 옆에 VoiceInputButton 배치
□ 4. onResult 콜백에서 기존 텍스트에 이어붙이기 패턴 적용
□ 5. disabled prop으로 수정 모드에서만 활성화 (필요 시)
□ 6. flex 레이아웃 확인:
     - input: items-center gap-2 (한 줄)
     - textarea: items-start gap-2 (상단 정렬)

8.2 백엔드

□ 1. 컨트롤러에 logSttUsage 메서드 추가
□ 2. AiTokenHelper::saveSttUsage() 호출 (메뉴명 지정)
□ 3. routes/web.php에 POST 라우트 등록

8.3 레이아웃 참고

┌───────────────────────────────────────────┐
│ label                                     │
│ ┌──────────────────────────────────┐ ┌──┐ │
│ │ input text                       │ │🎤│ │
│ └──────────────────────────────────┘ └──┘ │
│                                           │
│ label                                     │
│ ┌──────────────────────────────────┐ ┌──┐ │
│ │ textarea                         │ │🎤│ │
│ │                                  │ │  │ │
│ │                                  │ │  │ │
│ └──────────────────────────────────┘ └──┘ │
└───────────────────────────────────────────┘

9. 주의사항 및 트러블슈팅

9.1 HTTPS 필수

Web Speech API는 HTTPS 환경에서만 동작한다 (localhost는 예외). HTTP 배포 시 마이크 접근이 차단된다.

9.2 브라우저 자동 종료

continuous: true로 설정해도, 브라우저가 긴 무음 구간에서 자동으로 인식을 종료할 수 있다. onend 이벤트에서 이를 처리한다.

9.3 마이크 권한

첫 사용 시 브라우저가 마이크 접근 권한을 요청한다. 사용자가 거부하면 onerror가 발생하고 버튼이 중지 상태로 돌아간다.

9.4 컴포넌트 언마운트 시 정리

모달 안에서 사용할 경우, 모달이 닫힐 때 컴포넌트가 언마운트된다. useEffect cleanup에서 반드시 recognition.stop()과 clearTimeout을 호출해야 한다.

useEffect(() => {
    return () => {
        recognitionRef.current?.stop();
        if (dismissTimerRef.current) clearTimeout(dismissTimerRef.current);
    };
}, []);

9.5 이벤트 전파 방지

마이크 버튼이 form 안에 있으면 클릭 시 form submit이 발생할 수 있다. 반드시 e.preventDefault() + e.stopPropagation()을 호출한다.

const toggle = useCallback((e) => {
    e.preventDefault();
    e.stopPropagation();
    // ...
}, []);

9.6 다중 VoiceInputButton

한 페이지에 여러 VoiceInputButton을 배치할 수 있다. 각 인스턴스는 독립적인 recognitionRef를 가지므로 충돌하지 않는다. 단, 동시에 2개 이상 녹음은 불가하다 (브라우저 마이크 제한). 한 버튼이 녹음 중일 때 다른 버튼을 누르면 기존 녹음이 중단된다 (브라우저 동작).

10. 향후 확장 가능성

기능	설명	난이도
화자 분리 (Speaker Diarization)	여러 사람의 음성을 구분하여 각각 텍스트화	Google Cloud STT API 필요
다국어 전환	`recognition.lang`을 동적으로 변경	낮음
음성 명령	특정 키워드 인식 시 동작 수행 (예: "저장", "다음")	중간
녹음 파일 저장	MediaRecorder API로 음성 파일을 GCS에 저장	중간
실시간 번역	STT 결과를 번역 API로 전달	중간

부록 A: 참조 구현 파일

파일	설명
`mng/resources/views/juil/construction-photos.blade.php`	최초 적용 (VoiceInputButton 전체 코드)
`mng/app/Http/Controllers/Juil/ConstructionSitePhotoController.php`	logSttUsage 엔드포인트
`mng/app/Helpers/AiTokenHelper.php`	saveSttUsage 헬퍼
`mng/routes/web.php`	STT 라우트 등록 위치

부록 B: CSS 클래스 요약

요소	Tailwind 클래스
마이크 버튼 (대기)	`bg-gray-100 text-gray-500 hover:bg-blue-100 hover:text-blue-600 w-8 h-8 rounded-full`
마이크 버튼 (녹음)	`bg-red-500 text-white shadow-lg shadow-red-200`
프리뷰 패널	`bg-gray-900 rounded-lg shadow-xl w-[300px] max-h-[120px] overflow-y-auto`
확정 텍스트	`text-white text-xs font-normal transition-colors duration-300`
미확정 텍스트	`text-gray-400 text-xs italic transition-colors duration-200`
대기 표시	`text-gray-500 text-xs` + 빨간 점 `animate-pulse`
완료 표시	`text-green-400 text-xs` ✓
비활성화	`opacity-30 cursor-not-allowed`

25 KiB Raw Blame History