---
name: extract-requirements
description: Extracts requirements from indexed SharePoint documents via AI Search MCP, processes them through parallel subagents, and consolidates into a deduplicated hierarchy. Use when processing project documentation into structured requirements. Part of the /project-standup pipeline (Phase 1).
argument-hint: "[path-filter] [optional: --limit N] [optional: --project-dir DIR]"
---

# Requirements Extraction Pipeline

Extract requirements from SharePoint documents indexed in Azure AI Search and consolidate into a structured hierarchy with full source traceability.

## Arguments

- `$ARGUMENTS[0]` — Path filter for documents (e.g., `/SBA_Demo/`, `/ProjectX/Requirements/`). Required.
- `--limit N` — Process only the first N documents (for testing)
- `--project-dir DIR` — Output directory (default: `./extraction-results/`). When called from `/project-standup`, this is `./project-runs/{slug}/`

## Overview

The pipeline has 4 phases. Each phase writes output files that the next phase reads. This file-based handoff pattern avoids context window bottlenecks and enables parallel processing.

```
Phase 1: SCAN & PLAN          → _plan.json
Phase 2: EXTRACT (parallel)   → batch-001.json, batch-002.json, ...
Phase 3: CONSOLIDATE (2-pass) → _hierarchy_plan.json → formatted/*.json
Phase 4: CREATE (ADO)         → work items in ADO
```

---

## Guardrails

- **Never skip Phase 1 (Scan & Plan).** Even for small document sets, the planning phase prevents duplicate extraction and sets up proper batch boundaries.
- **Confirm extraction results with the user before proceeding to hierarchy building.** Phase 2 output should be reviewed for completeness and accuracy.
- **Source references are mandatory.** Every extracted requirement MUST include `source_doc`, `source_url`, and `snippet_range`. These propagate through the entire pipeline and appear on ADO work items.

## Phase 1: SCAN & PLAN

**Goal:** Discover documents, estimate complexity, create a batch plan.

1. Call `search_list_documents` with `path_filter` set to the provided path argument
2. For each document, note: `doc_id`, `doc_name`, `doc_url`, `snippet_count`
3. Categorize documents by size and batch them:
   - **Heavy** (>50 snippets): FDDs, field matrices, large specs → 1 subagent each
   - **Medium** (10-50 snippets): Application forms, guidelines → batch 3-5 per subagent
   - **Light** (<10 snippets): Checklists, letters, templates → batch 8-12 per subagent
4. Write the batch plan to `extraction-results/_plan.json`:
   ```json
   {
     "path_filter": "/SBA_Demo/",
     "total_documents": 5,
     "total_snippets": 300,
     "created_at": "ISO timestamp",
     "batches": [
       {"batch_id": "batch-001", "doc_ids": ["abc"], "doc_names": ["FDD.docx"], "total_snippets": 147, "strategy": "heavy-single"}
     ]
   }
   ```
5. Show the plan to the user and **confirm before proceeding**

---

## Phase 2: EXTRACT (Parallel Subagents)

**Goal:** Read each document's full content and extract every identifiable requirement.

For each batch, spawn a **background subagent** (`run_in_background: true`) with this prompt:

> You are a requirements extraction agent. Your job is to read document content from Azure AI Search and extract ALL requirements that would be needed to build this system in Microsoft Dataverse / Power Platform.
>
> For each document assigned to you:
> 1. Call `search_get_document_snippets` with the doc_id to get the full content
>    - Use `top` and `skip` params to paginate if snippet_count > 50
> 2. Read ALL snippets carefully — do NOT skip any
> 3. Extract every requirement you can identify into the categories below
> 4. **Write the results as JSON** to `extraction-results/{batch_id}.json` using the Write tool
>
> **Documents to process:** [list of doc_ids and names]
>
> **Requirement categories to extract:**
> - **Tables**: New tables/entities needed (name, purpose, key fields mentioned)
> - **Fields**: Specific fields mentioned (field name, data type if stated, which table, required/optional)
> - **Forms**: Form layouts, sections, tabs, field placements mentioned
> - **Views**: List views, filters, columns mentioned
> - **Business Rules**: Validation rules, conditional logic, calculated fields
> - **Security**: Roles, permissions, access restrictions mentioned
> - **Workflows**: Automated processes, notifications, approvals
> - **Dashboards**: Reporting needs, charts, KPIs mentioned
> - **Integrations**: External system connections, data imports/exports
> - **Data Migration**: Existing data that needs to be moved
> - **Documents**: Document management, templates, SharePoint integration
> - **BPFs**: Business process flows, stage-gate processes
>
> **Output JSON schema** (write to `{output_dir}/{batch_id}.json`):
> ```json
> {
>   "batch_id": "batch-001",
>   "extracted_at": "ISO timestamp",
>   "documents_processed": [
>     {"doc_id": "...", "doc_name": "...", "doc_url": "...", "snippet_count": N}
>   ],
>   "requirements": [
>     {
>       "id": "REQ-001",
>       "category": "Tables",
>       "title": "Application table",
>       "description": "Track grant/loan applications with status, applicant info, program type",
>       "details": ["Field: Application Number (auto-number)", "Field: Applicant Name (lookup to Account)"],
>       "source_doc": "FDD.docx",
>       "source_url": "https://sharepoint.com/.../FDD.docx",
>       "snippet_range": "snippets 12-15 of 38",
>       "source_context": "Section 3.2 - Application Processing",
>       "confidence": "high|medium|low",
>       "related_requirements": ["REQ-002", "REQ-005"]
>     }
>   ],
>   "cross_references": [
>     {"from_req": "REQ-001", "to_req": "REQ-005", "relationship": "table-has-fields"}
>   ]
> }
> ```
>
> **Rules:**
> - Extract EVERYTHING — it's better to over-extract than miss requirements
> - Use exact names/values from the documents (don't paraphrase field names, table names, option values)
> - Flag confidence level: "high" = explicitly stated, "medium" = strongly implied, "low" = inferred
> - Note cross-references between requirements when you see them
> - For forms: capture tab names, section names, field placements if mentioned
> - For fields: capture data type, required/optional, default values, option set values if mentioned
> - Number requirements sequentially within each batch (REQ-001, REQ-002, etc.)
> - **SOURCE TRACEABILITY:** For EVERY requirement, capture `source_doc` (filename), `source_url` (from doc_url in the search results), and `snippet_range` (e.g., "snippets 12-15 of 38" — the snippet indices you read to extract this requirement). These propagate through all downstream phases and appear as footers on ADO work items.
> - **IMPORTANT:** Write all output to the JSON file. Keep your text response SHORT (just confirm completion + requirement count). Do NOT output the full JSON in your text response — you will hit the output token limit.

**Wait for ALL subagents to complete** before proceeding.

### Lessons learned (from pilot testing)
- Large documents (100+ snippets) can produce 80-110 requirements each
- Subagents that try to output the full JSON in text (instead of using the Write tool) will hit the 32K output token limit — the prompt must emphasize writing to file
- Allow ~5-10 minutes per heavy document subagent

---

## Phase 3: CONSOLIDATE (Two-Pass)

**Goal:** Deduplicate, cross-reference, build hierarchy, and format stories.

Phase 3 is split into two passes to avoid output token limits on large requirement sets.

### Pass 1: Deduplicate & Build Hierarchy Plan

Spawn a **single background subagent** with this task:

> You are a requirements consolidation agent. Read ALL extracted batch files and produce a hierarchy plan.
>
> **Input files:**
> - Read ALL `extraction-results/batch-*.json` files using Glob + Read
> - Read the `create-ado-work-items` skill at `.claude/skills/create-ado-work-items/SKILL.md` for the standard scaffold Epic/Feature structure
> - Read `.claude/skills/create-ado-work-items/pbi-templates.md` for PBI template patterns
>
> **Process:**
> 1. Read ALL batch JSON files — read the FULL details of every requirement
> 2. **Deduplicate**: Same table/field/form mentioned across multiple documents → merge into one, keeping the richer version, noting both source refs
> 3. **Cross-reference**: Link related requirements (table + its fields + its security + its forms)
> 4. **Build hierarchy**: Group requirements into Epic → Feature → Story structure
>    - Map to standard scaffold Epics (Infrastructure Setup, Security, Documentation, Templates, Enhancements) where applicable
>    - Create new custom Epics/Features for project-specific functional areas
> 5. **Write a SLIM hierarchy plan** (not full HTML stories) to `extraction-results/_hierarchy_plan.json`
>
> **Output schema** (`_hierarchy_plan.json`):
> ```json
> {
>   "project_summary": "Brief description of the project based on all docs",
>   "dedup_stats": {"batch_001_count": N, "batch_002_count": N, "duplicates_found": N, "unique_after_dedup": N, "dedup_notes": "..."},
>   "epics": [
>     {
>       "title": "Epic Name",
>       "is_scaffold": true|false,
>       "features": [
>         {
>           "title": "Feature Name",
>           "stories": [
>             {
>               "title": "Story title (action-oriented)",
>               "category": "Tables|Forms|Views|Business Rules|...",
>               "source_refs": ["batch-001:REQ-001", "batch-002:REQ-015"],
>               "key_details": ["Critical detail 1", "Critical detail 2"]
>             }
>           ]
>         }
>       ]
>     }
>   ]
> }
> ```
>
> **Rules:**
> - Read ALL requirement details to make proper dedup decisions — the slim output is a DECISION MAP, not a summary
> - Do NOT write full HTML descriptions in this pass — just title + source_refs + key_details
> - Merge overlapping requirements: keep the richer version's details, list both source_refs
> - One story per discrete deliverable (don't combine multiple tables or roles)
> - Security items always go under the Security scaffold Epic
> - Infrastructure/environment items under Infrastructure Setup
> - IMPORTANT: Write all output to file. Keep text response SHORT.

### Pass 1.5: Architectural Review (Parallel Review Agents)

After Pass 1 completes and BEFORE formatting, spawn **two parallel review agents** that evaluate the hierarchy plan from different lenses. These reviews are critical because the ADO stories directly drive downstream app development — poorly scoped or structured requirements result in a poorly built app.

**Agent 1: App Architect Review**

> You are a Power Platform / Model-Driven App architect. Review the hierarchy plan for app design quality.
>
> **Read:** `extraction-results/_hierarchy_plan.json` + all `extraction-results/batch-*.json` for full context
>
> **Evaluate and report:**
> - Is the form/view design clean and functional from an end-user perspective?
> - Are navigation patterns logical (sitemap grouping, drill-down flows)?
> - Are there missing UX requirements (search config, quick view forms, subgrids)?
> - Do the business rules make sense from a user workflow perspective?
> - Are there redundant or overlapping forms/views that should be consolidated?
> - Are dashboard/reporting stories properly scoped for the target platform (Dataverse dashboards vs Power BI)?
> - Are BPF stages aligned with actual user workflows?
>
> **Output:** Write recommendations to `extraction-results/_review_app_architect.json` with:
> - `approved_stories`: stories that are well-scoped (no changes needed)
> - `modify_stories`: stories needing scope/description changes (with specific suggestions)
> - `add_stories`: missing stories that should be added
> - `remove_stories`: stories that are redundant or out of scope
> - `merge_stories`: stories that should be combined

**Agent 2: Database Architect Review**

> You are a Dataverse / relational database architect. Review the hierarchy plan for data model quality.
>
> **Read:** `extraction-results/_hierarchy_plan.json` + all `extraction-results/batch-*.json` for full context
>
> **Evaluate and report:**
> - Are table designs normalized appropriately (not over/under-normalized)?
> - Are relationships correct and efficient (1:N vs N:N, cascade rules)?
> - Are field types appropriate (string lengths, decimal precision, choice vs lookup)?
> - Are there missing indexes, alternate keys, or dedup rules needed?
> - Are naming conventions consistent across tables/fields?
> - Are option sets reused where appropriate (global vs local choices)?
> - Is the data model extensible for future requirements?
> - Are there performance concerns (too many columns on one table, missing views for common queries)?
>
> **Output:** Write recommendations to `extraction-results/_review_db_architect.json` with same structure as above.

**After both reviews complete:**
1. Read both review files
2. Apply approved modifications to `_hierarchy_plan.json` (update titles, descriptions, merge/split stories)
3. Add any new stories identified by reviewers
4. Remove any stories flagged as redundant
5. Show the review summary + changes to the user and **confirm before proceeding to Pass 2**

### Pass 2: Format Stories (Parallel Per-Epic Agents)

After Pass 1.5 review completes, spawn **parallel background subagents** — one per Epic or group of related Epics — to generate full HTML-formatted stories.

**Agent grouping strategy** (balance workload across 3-5 agents):
- **Agent A**: All scaffold Epics (typically 10-20 stories)
- **Agent B**: Largest custom Epic (e.g., Core Data Model)
- **Agent C**: Mid-size Epics grouped (e.g., Processing + Forms + BPFs)
- **Agent D**: Remaining Epics grouped (e.g., Integration + Views + Dashboards)

Each agent's prompt:

> You are a story formatting agent. Generate fully formatted ADO User Stories with proper HTML.
>
> **Your assigned Epics:** [list epic titles]
>
> **Input files to read:**
> 1. `extraction-results/_hierarchy_plan.json` — Story titles, source_refs, key_details
> 2. `extraction-results/batch-*.json` — Full requirement details (look up each source_ref)
> 3. `.claude/skills/create-ado-work-items/pbi-templates.md` — PBI template patterns to follow
>
> **Process:**
> 1. Read the hierarchy plan to get your assigned stories
> 2. For each story, look up the FULL requirement details from the batch files using the source_refs
> 3. Match to PBI template patterns where applicable (Create New Table, Create Security Role, etc.)
> 4. Generate full HTML description and acceptance criteria
> 5. Write output to `extraction-results/formatted/{epic-slug}.json`
>
> **HTML Formatting Standards:**
>
> Story Description:
> ```html
> <b>AS A</b> [role/persona]<br>
> <b>I WANT</b> [what the user needs]<br>
> <b>SO THAT</b> [business benefit]<br>
> <br>
> <b>Implementation Details:</b><br>
> <ul>
> <li>Specific detail with exact field names, table names, values from requirements</li>
> </ul>
> ```
>
> Acceptance Criteria:
> ```html
> <b>GIVEN</b> [precondition or context]<br>
> <b>WHEN</b> [specific action]<br>
> <b>THEN</b> [expected result]<br>
> <br>
> <b>GIVEN</b> [another scenario]<br>
> <b>WHEN</b> [another action]<br>
> <b>THEN</b> [another result]
> ```
>
> **Rules:**
> - Use EXACT names/values from source requirements (don't paraphrase)
> - Match PBI template patterns when applicable
> - Include 2-4 GIVEN/WHEN/THEN scenarios per story
> - Write output to file. Keep text response SHORT.
>
> **Output schema** (`extraction-results/formatted/{slug}.json`):
> ```json
> {
>   "formatted_at": "ISO timestamp",
>   "epics": [
>     {
>       "title": "Epic Name",
>       "is_scaffold": true|false,
>       "features": [
>         {
>           "title": "Feature Name",
>           "stories": [
>             {
>               "title": "Story title",
>               "description_html": "full HTML",
>               "acceptance_criteria_html": "full HTML",
>               "source_requirements": ["batch-001:REQ-001"]
>             }
>           ]
>         }
>       ]
>     }
>   ]
> }
> ```

**Wait for ALL formatting agents to complete.**

Show the hierarchy summary to the user (epic/feature/story counts) and **confirm before proceeding to Phase 4**.

### Lessons learned (from pilot testing)
- Pass 1 on ~200 requirements across 2 documents found 62 duplicates and took ~30 minutes
- Pass 2 with 4 parallel agents formatted 173 stories in ~8-10 minutes
- The two-pass split is essential: a single agent trying to write full HTML for 170+ stories will exceed the 32K output token limit
- Pass 1 reads ALL details for reasoning (dedup decisions made with full context) — the slim output is just a decision map
- Pass 2 agents re-read original batch files via source_refs to get full details when writing stories
- Agent grouping should aim for 40-70 stories per agent for balanced parallelism

---

## File Structure

```
{output_dir}/
├── _plan.json                  # Phase 1: batch plan
├── batch-001.json              # Phase 2: per-batch extracted requirements
├── batch-002.json
├── batch-NNN.json
├── _hierarchy_plan.json        # Phase 3 Pass 1: deduplicated hierarchy (slim)
├── _review_app_architect.json  # Phase 3 Pass 1.5: app architecture review
└── _review_db_architect.json   # Phase 3 Pass 1.5: database architecture review
```

**Note:** ADO work item creation (formerly Phase 4 in this skill) has moved to the `/create-ado-work-items` skill, invoked as Phase 3 of the `/project-standup` pipeline. The full architecture review (Phase 2 of pipeline) happens via `/architect-app` between extraction and ADO creation.

---

## Timing Estimates (from pilot: 2 heavy docs, ~260 snippets)

| Phase | Duration | Notes |
|-------|----------|-------|
| Phase 1: Scan & Plan | ~1 minute | Single MCP call + categorization |
| Phase 2: Extract | ~5-10 min | Parallel subagents, depends on doc count/size |
| Phase 3 Pass 1: Consolidate | ~20-30 min | Single agent reads all, deduplicates, builds hierarchy |
| Phase 3 Pass 2: Format | ~8-10 min | 4 parallel agents writing full HTML stories |
| Phase 4: Create | ~15-30 min | Depends on story count and ADO API rate limits |

---

## Notes

- For large document sets (50+ docs), Phase 2 may need 10+ parallel agents
- Phase 3 Pass 1 is the bottleneck — for very large projects (500+ requirements), consider splitting the consolidator by category first, then cross-referencing
- Always review the hierarchy summary before Phase 4 — this is the last checkpoint before ADO writes
- The `extraction-results/` directory should be in `.gitignore` as it contains transient processing artifacts
- Subagents MUST write output to files (not text responses) to avoid the 32K output token limit

---

## Pipeline Integration

This skill is **Phase 1** of the `/project-standup` pipeline:

```
Phase 1: EXTRACT (this skill) → _hierarchy_plan.json
Phase 2: ARCHITECT (/architect-app) → build-specification.json
Phase 3: CREATE ADO (/create-ado-work-items) → ado-work-items.json
Phase 4: BUILD APP (/build-dataverse-app) → build-log.json
Phase 5: BUILD FLOWS (/build-power-automate-flows) → build-log.json
Phase 6: VERIFY & CLOSE → standup-report.md
```

The output of this skill (`_hierarchy_plan.json` + `batch-*.json` files) feeds into the `/architect-app` skill which runs a multi-agent architecture review before any ADO items are created.
