[-] [email protected] 1 points 13 hours ago

@xyro Ah, I see! I’m not using Ollama at the moment — my setup is based on GPT4All with a locally hosted DeepSeek model, which handles the semantic parsing directly.

As mentioned earlier, the pipeline doesn’t just diff pages — it detects new document URLs from the source feed (via selectors), downloads them, and generates structured summaries. Here's a snippet from the YAML config to illustrate how that works:

(extract:
  events:
    selector: "results[*]"
    fields:
      url: pdf_url
      title: title
      order_number: executive_order_number

download:
  extensions: [".pdf"]

gpt:
  prompt: |
    Analyze this Executive Order document:
    - Purpose: 1–2 sentences
    - Key provisions: 3–5 bullet points
    - Agencies involved: list
    - Revokes/amends: if any
    - Policy impact: neutral analysis
)

To keep things efficient, I also support regex-based extraction before passing content to the LLM. That way, I can isolate relevant blocks (e.g. addresses, client names, conclusions) and reduce the noise in the prompt. Example from another config:

processing:
  extract_regex:
    - "object of cultural heritage"
    - "address[:\\s]\\s*(.{10,100}?)(?=\\n|$)"
    - "project(?:s)?"
    - "circumstances"
    - "client\\s*:?\\s*(.{10,100}?)(?=\\n|$)"
    - "(?:conclusions?)\\s*(.{50,300}?)(?=\\n|$)"

Let me know if you're experimenting with similar flows — I’d be happy to share templates or compare how DeepSeek performs on your sources!

[-] [email protected] 1 points 23 hours ago

Hello! For changedetection.io there is setup instruction with PIP install: https://github.com/dgtlmoon/changedetection.io/wiki/Microsoft-Windows What is your use case?

[-] [email protected] 1 points 1 day ago

@xyro Thanks for sharing your case! I’ve also tested changedetection.io — it’s a great tool for basic site monitoring.

But in my tests, it doesn’t go beyond the surface. If there’s a page with multiple document links, it’ll detect changes in the list (via diff), but it won’t automatically download and analyze the new documents themselves.

Here’s how I’ve approached this:

  1. Crawl the page to extract links
  2. Detect new document URLs
  3. Download each document and extract keywords
  4. Generate an AI summary using a local LLM
  5. Add the result to a readable feed

P.S. If it helps, I can create a YAML template tailored to your grant-tracking case and run a quick test.

29
submitted 1 day ago by [email protected] to c/[email protected]

Hello! I'm evaluating tools to track changes in:

  • Government/legal PDFs (new regulations, court rulings)
  • News sites without reliable RSS
  • Tender portals
  • Property management messages (e.g. service notices)
  • Bank terms and policy updates

Current options I've tried:
• Huginn — Powerful but requires significant setup, no unified feed • Changedetection-io — good for HTML, limited for documents

Key needs:
✓ Local processing (no cloud dependencies)
✓ Multi-page PDF support
✓ Customizable alert rules
✓ Trying to reduce manual monitoring overhead — looking for robust, offline-first approaches

What's working well for others? Especially interested in:

  1. Solutions combining OCR + text analysis
  2. Experience with local LLMs for this (NLP, not just diff)
  3. Creative workarounds you've built

(P.S. Testing a deep scraping + LLM pipeline — if results look promising, will share.)

alfablend

0 post score
0 comment score
joined 1 day ago