## [0.4.2] - 2026-06-02 ### Fixed - **Paywall bypass on mid-page paywall markers** -- detectPaywall() 16KB head-sample was too small for raw HTML from the bypass chain (e.g. macropolis.gr Googlebot response has the paywall curtain at position ~16,800). Replaced with three-window scan: 16KB head + 4KB tail + full text on pages >20KB. - **Reject still-paywalled bypass results** -- pullPageEnhanced now checks bypassed.paywall.paywalled before accepting a bypassed response. When Googlebot still serves the paywall or Playwright is not installed, the user sees the honest bypass strategies exhausted notice instead of a misleading 100% clean success. - **B2B and analysis-site paywall markers** -- Added 19 new high-weight text markers covering macropolis.gr and similar EU-policy sites. ### Added - **Chromium output detection** -- detectChromiumError() recognizes Chromium running without the --no-sandbox flag errors so they do not get mistaken for paywall markers. - **57 new unit tests** covering deep-marker detection, tail detection, bypass safety, and large-page scanning. # Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## [0.3.7] - 2026-05-22 ### Fixed - **arXiv vertical extractor URL coverage** — `matchesArxiv()` now matches `export.arxiv.org/api/query?...` and `arxiv.org/pdf/...` URLs in addition to the previously-only-matched `arxiv.org/abs/...` pattern. `extractArxiv()` extracts paper IDs from all three URL formats and reuses the original API response when the input is already the query endpoint (avoids a redundant second HTTP request). Previously, `api/query` URLs fell through to generic XML extraction, producing garbled results. Added 7 unit tests for the new patterns. ### Added - **`Sec-Ch-Ua` client hint headers** — `buildHeaders()` now includes User-Agent Client Hints (`Sec-Ch-Ua`, `Sec-Ch-Ua-Mobile`, `Sec-Ch-Ua-Platform`) alongside existing Fetch Metadata headers. Completes the Chrome 120 browser fingerprint for better anti-bot resistance. - **`isLikelyJSRendered()` heuristic** — Detects SPA shell pages by checking if body text is <500 chars but has >3 `