어디살지 POC 검색 — 2026 최적 설계

헥사고날 · 하이브리드 검색 (Dense + Sparse + Filter + Rerank) · 오버피팅 방지 · 점진 검증 · 2026-05-19

TL;DR

임베딩 lane 4분리: Dense(자연어 의미), Sparse(정확 매칭), Payload Filter(숫자·enum), 제외(노이즈)
2026 패턴: Query Understanding → Hybrid (Dense + BM25) Native Fusion → Cross-encoder Rerank → 추천 발화 생성
하드코딩 회피: 정규식 0, magic value 0 — 모든 임계값 config.yaml, persona·area·synonym 은 PG 테이블, 슬롯 추출은 LLM structured output
오버피팅 방지: train(scenarios_60) ↔ holdout(scenarios_5) 영구 분리. N=5→20→50→100 단계별 게이트, 합격 시에만 다음 단계 매물 증가
헥사고날 5 Port: Embedding · Sparse · Vector · Reranker · LLM — 어댑터 swap (Gemini → bge-m3, Cohere → bge-reranker-v2-m3) 시 use case 무변경

1. 임베딩 lane 분리 매트릭스 — "무엇을 임베딩하고 무엇을 안 하는가"

한 필드를 모든 인덱스에 넣으면 신호가 희석되고 비용·지연만 늘어남. 의미·정확매칭·필터 세 lane 으로 분리하는 게 2026 표준.

필드	유형	현재 위치	Lane	이유
description (LLM-enrich)	자연어	properties.description	DENSE	단지명·평수·층·옵션·지역특성 의미 표현
address	지명 텍스트	properties.address	DENSE + SPARSE	의미(주변 지역) + 정확 매칭(동·구 정확 토큰)
domain_keywords (LLM 5-10개 명사)	토큰 set	raw_payload	SPARSE	"역세권"·"학원가" BM25 IDF 변별력
complex_name (단지명)	고유명사	raw_payload	SPARSE	"래미안"·"자이" 정확 토큰 매칭 중요
property_type	enum	property_type_enum	FILTER	한국어 토큰 ("투룸") 은 description 에 이미 포함 → 영문 enum 임베딩 노이즈
room_count / bath_count	int	properties	FILTER	"2룸" 텍스트 의미 약함 → range filter 가 정확
deposit / rent / maintenance_fee	BIGINT	properties	FILTER	9자리 숫자 임베딩 의미 손실, BM25 IDF 0
floor_category	enum	floor_category_enum	FILTER	"고층" 토큰은 description 에 자연어로 포함됨
floor_number	int	properties	FILTER	range filter
pet_allowed / parking / elevator	bool	raw_payload	FILTER	매물 카드 조건 일치 — exact filter
coord (lat/lng)	geography	PostGIS POINT	FILTER	ST_DWithin / Qdrant geo radius — embedding 절대 X
available_in_days	int	properties	FILTER	"즉시입주" urgency 슬롯과 결합
options (가전·가구 list)	text[]	raw_payload.options	SPARSE + FILTER	"풀옵션" BM25 + required_amenities 정확 매칭
direction	enum	raw_payload	FILTER	"남향" 필터
build_year	int	raw_payload	FILTER	"신축"은 enrich description 으로, exact 는 range filter
contract_type	enum (전세/월세)	raw_payload	FILTER	"월세"/"전세" — 검색 의도 강함, filter 명확
area_features	text[]	raw_payload	DENSE	"한강뷰"·"학원가" 자연어 — description 에 LLM 이 흡수
persona_tags (예측치)	set	persona_rules	FILTER	발화에서 추출 → required_payload subset 매칭
id / external_id / created_at	식별·시간	properties	제외	의미·검색 무관, 디버깅·페이지네이션 보조
raw_payload (원시 JSONB)	blob	properties	제외	구조화 후 위 lane 으로 재분배. raw 자체는 인덱싱 X

현재 src/domain/listing_text.py:build_indexable_text 는 이미 이 원칙을 따른다 (address + description + domain_keywords만 임베딩, 가격/방수/영문 enum 제외). 본 PR 에선 sparse(BM25/SPLADE) 별도 인덱스만 추가하면 됨.

2. 검색 파이프라인 — 2026 최적 패턴

사용자 발화 ─► [L1] LLM Query Understanding (Groq function-calling) └─► structured slots: intent · clarity · property_types · area_keywords price_range · room_count · required_amenities persona_tags · urgency · sentiment │ ▼ [L2] Filter 구성 (slots → Qdrant Filter + PG WHERE) - 정확 매칭은 모두 filter (반드시 필터) - 동의어 확장은 PG `synonyms` 테이블 lookup (LLM 호출 X) │ ▼ [L3] Hybrid Retrieval (Qdrant native fusion) ┌─ Dense : Gemini Embedding 2 (3072d) ─┐ RRF / linear ├─ Sparse : BM25 + Kiwi 형태소 ─┤ weight α / β └─ Filter : payload + geo radius ─┘ (config.yaml) │ top-50 후보 ▼ [L4] Cross-encoder Rerank Cohere Rerank-Multilingual-v3 또는 BGE-Reranker-v2-M3 top-50 → top-10 (의미적 재정렬) │ ▼ [L5] Persona Boost (PG `persona_rules` 룰 기반 가중치) 오버라이드 X — 단순 score adjustment (deterministic) │ ▼ [L6] Result + Suggestion Generation (LLM) 매물 카드 + 4축 추천 발화 (다음 행동 제안) │ ▼ [L7] Streaming Response (SSE) stage → listing → suggestion → delta → signature → done

3. 헥사고날 5 Port — 어댑터 swap 가능성

Port	인터페이스	현 어댑터	대체 후보 (swap 검증)
`EmbeddingPort`	`embed(texts) → vectors[3072]`	OpenRouter Gemini Embedding 2	bge-m3 (1024) · text-embedding-3-large (3072) · KoSimCSE
`SparsePort` (신규)	`encode(text) → sparse_vec`	BM25+Kiwi	SPLADE-multilingual · BGE-M3 sparse mode
`VectorStorePort`	`search(filter, dense, sparse, k)`	Qdrant 1.16 (collection `poc_listings_gemini_3072`)	Weaviate · Milvus · PG `pgvector`+RUM
`RerankerPort`	`rerank(query, docs) → reordered`	Cohere Rerank-4-Pro (OpenRouter 경유)	BAAI/bge-reranker-v2-m3 (로컬, 무비용) · ColBERTv2 (late interaction)
`LLMPort`	`extract_slots / classify_listing / suggest_4`	Groq `gpt-oss-120b`	Gemini 3 Pro · Claude Haiku 4.5 (다국어·function calling 강세)

Use case (chat_service.py·search_service.py·listing_registrar.py) 는 Port 만 의존 → 어댑터 어떤 조합이든 wiring (DI) 만 변경.

4. 오버피팅 방지 — 평가 분할 + cross-validation

원칙 1. Train ↔ Holdout 영구 분리. 시나리오 65 중 60 (train, 튜닝용) / 5 (holdout, 제출 직전만 평가). holdout 결과 보고 코드 안 바꿈.

원칙 2. 5축 균등 (의도 명확도·인생단계·라이프스타일·우선순위·직업) — 한 카테고리 과적합 방지. 매 단계 합격 기준은 5축 평균.

원칙 3. Synthetic ↔ Real 매물 cross. Stage 3 에서 메일 리포트의 대전 실주소 일부를 추가해 generalization 검증 (synthetic 만으로 학습된 가중치가 real address 에서 깨지는지).

원칙 4. 시드 N 단계별 성능 곡선. N=5 / 20 / 50 / 100 에서 NDCG@10·MRR@10 추적. 단조 증가 안 하면 retrieval 가중치/임베딩 텍스트 버그 의심.

원칙 5. LLM Judge 5점 평균 + 변동성 (std) 둘 다 본다. 평균만 높고 std 큰 경우는 일부 카테고리 과적합 신호.

5. 하드코딩 회피 — 어디서 어떻게 외부화

대상	예전 위치 (회피해야)	2026 위치 (외부화)
슬롯 추출 룰	정규식 / 키워드 매칭	LLM function calling (`extract_filters` structured output)
persona → required 필터	코드 dict `PERSONA_RULES`	PG `persona_rules` 테이블 (이미 적용)
지역 좌표	코드 dict `LOCATIONS`	PG `areas` 테이블 + Kakao geocoding 폴백 (이미 적용)
property_type 한국어 매핑	코드 dict `{"oneroom":"원룸",...}`	PG `enum_labels` 테이블 또는 i18n 리소스
동의어 확장	코드 array	PG `synonyms` 테이블 (LLM 정규화 산출물)
가중치 (α dense / β sparse / persona boost)	magic value 0.75 / 0.25	`config.yaml` · 환경변수 — 어댑터별 override 가능
임계값 (recall threshold·rerank top-K)	코드 상수	`config.yaml`
enum 정의	각 어댑터·테스트 자체 enum	`src/domain/enums.py` 단일 진실 (이미 적용)
매물 텍스트 enrich	템플릿 문자열	LLM `classify_listing` structured output (이미 적용)

6. 점진 검증 단계 — 소수 시작 + 합격 게이트

STAGE 1 · N=5

Smoke

POC compose up
5 매물 (지역 5개)
1 발화 ("강남 투룸 6000/80")
/v2/ai/chat 200 + 매물 카드 ≥ 1

합격: 200 응답 + SSE 모든 stage 수신

STAGE 2 · N=20

의도 추출 정확도

20 매물 (지역 16개 균등)
scenarios_65 의 intent+life_stage 11건
slot subset match

합격: extract precision ≥ 80% · 11/11 응답 success

STAGE 3 · N=50

5축 균등

50 매물 (synthetic 40 + real 10)
scenarios_65 의 5축 train 30건
NDCG@10 · MRR@10 측정

합격: NDCG@10 ≥ 0.55 · 모든 축 평균 ≥ 0.50

STAGE 4 · N=100

Holdout + LLM Judge

100 매물
scenarios_65 전수 + holdout 5건
LLM Judge 5점 평균·std

합격: LLM Judge ≥ 4.0 · std ≤ 0.6

각 단계 사이 합격 못 하면 다음 매물 증가 금지. 회귀 발생 시 직전 stage 로 롤백 + 가중치/임베딩 텍스트 부터 점검.

7. 위험 / 완화

위험	등급	완화
Cohere Rerank 응답 지연 (외부 API)	MED	top-K 50 제한 + Redis 캐시 (query_hash, 5분) + 폴백 (BGE-Reranker 로컬)
Gemini 임베딩 비용·rate limit	MED	임베딩 캐시 24h (이미 적용) + 매물 등록 시점 1회만 호출
synthetic 만으로 train 시 real address 회귀	HIGH	Stage 3 부터 real 매물 10건 cross-validate
LLM slot 추출 분산도 (재현성)	MED	temperature=0.0 + seed 고정 + Judge 평가 시 std 추적
POC ↔ 본 백엔드 의존 누수	LOW	import-linter `poc-chat-search` ↛ `backend` 금지 추가
가중치 오버피팅 (특정 시나리오 hand-tune)	HIGH	holdout 5건 제출 직전 1회만 평가 + train 60건 cross-validation

8. 즉시 실행 — Stage 1 시작

cd poc-chat-search
docker compose up -d                       # PG 30432 / Redis 30379 / Qdrant 30333 / API 30811
cp .env.example .env                       # OPENROUTER_API_KEY · GROQ_API_KEY 필요
python scripts/init_db.py --reset --n=5    # Stage 1 시드 (5건)

# 검증
curl -X POST http://localhost:30811/v2/ai/chat \
  -H "Content-Type: application/json" \
  -d '{"user_id":"smoke","thread_id":"t1","user_message":{"content":"강남 투룸 6000/80"},"variant":"full"}'

# Frontend
echo "VITE_POC_API_URL=http://localhost:30811" >> frontend/.env.local
cd frontend && npm run dev   # /chat-v2 페이지에서 발화 테스트

참고 — 이미 구현된 자산

src/domain/listing_text.py:build_indexable_text — 본 설계의 lane 분리 (가격/방수/영문 enum 제외) 이미 적용
src/domain/enums.py — 단일 진실 (PROPERTY_TYPES·PERSONA_TAGS·CLARITY·URGENCIES 등)
migrations/001_init.sql — enum 강제 + persona_rules / area_market_stats / areas 테이블
src/application/listing_registrar.py — LLM structured output 으로 raw → enriched 변환 (정규식 0)
tests/scenarios_65.py + llm_judge.py — 65 시나리오 + 자동 점수화
src/adapters/embedding/openrouter_gemini_adapter.py + reranker/openrouter_cohere_adapter.py