Install
openclaw skills install civil-judgment-taiwan-vectorstoreIngest Taiwan civil court judgments (HTML or PDF) — exclusively covering Taiwan civil cases — into Qdrant with Ollama embeddings, preserving traceability, deduplication, and incremental updates.
openclaw skills install civil-judgment-taiwan-vectorstoreScope: Taiwan civil court judgments only (民事判決). This skill ingests Taiwan civil cases (HTML or PDF files) into Qdrant. All parsing, chunking, and embedding logic lives in scripts/ingest.py — your job is to run the script, not to reimplement the pipeline.
source {baseDir}/.venv/bin/activate
The user will provide an absolute path to a run folder.
Example: /path/to/output/judicialyuan/20260305_142030
Verify it exists and has HTML or PDF files:
ls <RUN_FOLDER>/archive/ | grep -E '\.(html|pdf)$' | head -5
If no archive/*.html or archive/*.pdf files → stop and tell the user the folder has no ingestible data.
Use absolute paths throughout — no cd needed:
python3 {baseDir}/scripts/ingest.py \
--run-folder <RUN_FOLDER>
The script handles everything: pre-flight checks, collection auto-creation (creates civil_case_doc / civil_case_chunk if they don't exist), canonicalization, chunking, embedding, Qdrant upsert, manifest + report writing.
Re-running the same command on the same folder is always safe — deterministic IDs mean upsert = overwrite. No special --resume flag needed; just run the same command again.
Successful output looks like:
OK files=42 processed=42 skipped=0 errored=0 doc_points=42 chunk_points=187
manifest=<RUN_FOLDER>/ingest_manifest.jsonl
report=<RUN_FOLDER>/ingest_report.md
Read the report (human-readable stats summary):
cat <RUN_FOLDER>/ingest_report.md
If there are errors, check the manifest (machine-readable, one JSON line per file) for per-file diagnosis:
grep -E '"status":"(skipped|error|partial)"' <RUN_FOLDER>/ingest_manifest.jsonl
Tell the user:
doc_points)chunk_points)Done. Do not proceed to additional steps unless the user asks.
ingest.py handles all of this.verify=False or skip SSL verification for any HTTP request.archive/. Raw HTML is immutable source of truth.--max-chars, --overlap-chars) unless the user explicitly asks.doc_url + local_path.parser_version in every point's metadata. Current: v3.5-sentence-boundary.PREFLIGHT_FAILED: Qdrant not reachableQdrant is down or unreachable at the default/configured URL.
# Check if Qdrant is running
curl -s http://localhost:6333/collections | head -1
# If not running, start it (or ask the user)
PREFLIGHT_FAILED: Ollama not reachable# Check Ollama
curl -s http://localhost:11434/api/tags | head -5
PREFLIGHT_FAILED: Ollama model missing: bge-m3:latestollama pull bge-m3:latest
Then re-run Step 3.
PREFLIGHT_FAILED: No archive/*.html or archive/*.pdf foundThe run folder exists but has no archived detail pages. Check:
skipped > 0 or errored > 0Check ingest_manifest.jsonl for per-file details:
grep -E '"status":"(skipped|error|partial)"' "<RUN_FOLDER>/ingest_manifest.jsonl"
| Manifest status | Meaning | Action |
|---|---|---|
ok | Doc + all chunks ingested | None |
partial | Doc upserted, but some section chunks failed embedding | Check Ollama stability; can re-run safely |
skipped | Doc-level embedding failed — nothing upserted for this doc | Check Ollama; re-run safely |
error | HTML read/parse failed | Check if the HTML file is corrupted |
Re-running is always safe — use the exact same command. No special flags needed; deterministic IDs → upsert/overwrite.
# Via environment variables
OLLAMA_URL=http://localhost:11434 QDRANT_URL=http://localhost:6333 \
python3 scripts/ingest.py --run-folder "..."
# Via CLI flags (take precedence over env vars)
python3 scripts/ingest.py --run-folder "..." \
--ollama http://localhost:11434 --qdrant http://localhost:6333
Default endpoints:
| Service | Default | Env override |
|---|---|---|
| Ollama | http://localhost:11434 | $OLLAMA_URL |
| Qdrant | http://localhost:6333 | $QDRANT_URL |
python3 scripts/ingest.py --run-folder "..." --limit 5
<run_folder>/
archive/
fjud_detail_001.html ← HTML input
fjud_detail_002.html
fjud_detail_003.pdf ← PDF input (also supported)
fint_detail_001.html (if system=both)
results_fjud.jsonl (optional)
results_fint.jsonl (optional)
The script discovers all archive/*.html and archive/*.pdf files automatically (sorted by filename). HTML and PDF files can coexist in the same run folder.
v1 limitation: The system metadata field is currently hardcoded to FJUD. If a run folder contains both FJUD and FINT files, FINT files will be ingested but mislabeled as FJUD. This does not affect chunking or embeddings — only the system metadata field on the resulting Qdrant points.
python3 scripts/ingest.py --run-folder <PATH> [options]
| Flag | Default | Description |
|---|---|---|
--run-folder | (required) | Path to an input folder |
--ollama | $OLLAMA_URL or http://localhost:11434 | Ollama endpoint |
--qdrant | $QDRANT_URL or http://localhost:6333 | Qdrant endpoint |
--embed-model | bge-m3:latest | Ollama embedding model |
--vector-size | 1024 | Vector dimension |
--max-chars | 900 | Max chars per chunk (500–1000) |
--overlap-chars | 150 | Overlap between chunks (10–20% of max-chars) |
--limit | 0 (no limit) | Process only first N files sorted by filename (lexicographic order); for testing |
civil_case_doc (1 point/doc), civil_case_chunk (many points/doc). Auto-created if they don't exist.ingest_report.md: human-readable summary (doc/chunk counts, error counts). Read this first after ingestion.ingest_manifest.jsonl: machine-readable, one JSON line per doc with status (ok / partial / skipped / error). Read this to diagnose specific file failures (grep for non-ok statuses). Both files overlap on aggregate counts; the manifest adds per-file detail.civil_case_issue collection)For metadata schema, canonicalization rules, section-splitting patterns, and chunking implementation, see references/internals.md.
400 Bad Request). The script uses deterministic UUIDs — do not change the ID generation logic.full 做 chunking,避免只留下 doc-level points。