Global News Publishers Actor

Ultimate News Scraper - Rise of the Phoenix

Extract real-time and historical article data from a large catalog of global news publishers with category targeting and clean JSON output.

News / Media

What it does

Extract real-time and historical article data from a large catalog of global news publishers with category targeting and clean JSON output.

Best for

  • News monitoring
  • Market and competitor intelligence
  • SEO and content research
  • Publisher coverage tracking

Fields

  • Site name
  • Country
  • Region
  • Language
  • Article title
  • Author
  • Article body

Inputs

  • Websites to scrape
  • Categories to scrape
  • Execution mode
  • Historic cutoff date
  • Max items per site
README

Ultimate News Scraper - Rise of the Phoenix technical notes

Ultimate News Scraper - Rise of the Phoenix is an Apify Actor for real-time and historical article extraction across 800+ global publishers. It supports category targeting, proxy configuration, and fallback crawling using Scrapling, PyDoll, and Selenium. The workflow is useful when a team needs structured news article data for monitoring, research, market intelligence, SEO analysis, or internal reporting.

Use Cases

  • News monitoring
  • Market and competitor intelligence
  • SEO and content research
  • Publisher coverage tracking
  • Historical article collection

Data Fields

  • Site name
  • Country
  • Region
  • Language
  • Article title
  • Author
  • Article body
  • Tags
  • Published date
  • Article URL
  • URL hash
  • Main image URL
  • SEO description
  • Scraped at timestamp
  • Scraping tool
  • Execution mode
  • Category URL
  • Source HTML language
  • Cutoff filtered flag

Inputs

  • Websites to scrape
  • Categories to scrape
  • Execution mode
  • Historic cutoff date
  • Max items per site
  • No article limit
  • Proxy configuration
  • Manual site category filters

Workflow

  • Selected news publishers or categories
  • Actor run with current or historic mode
  • Clean article JSON dataset
  • Delivery to spreadsheet, database, API, or alerts
  • Monitoring report or intelligence workflow

Delivery

  • CSV
  • Excel
  • Google Sheets
  • API
  • Database
  • Airtable
  • Notion
  • Slack
  • CRM

Limitations

  • Coverage depends on the Actor's active publisher catalog and each publisher's site structure.
  • Some article fields may be unavailable on specific publishers.
  • Historic collection depends on available article archives and the configured cutoff date.
  • Proxy and fallback crawling can improve coverage but do not guarantee every target article will be available.
  • The workflow is reviewed before setup.

Setup Notes

  • Choose specific publishers, categories, or the broader active catalog depending on the monitoring goal.
  • Use current mode for fresh coverage and historic mode with a cutoff date for archive collection.
  • Configure proxy and fallback options when publisher coverage or reliability needs extra review.

Output Handling

  • Keep source metadata, article title, author, body, publication date, URL, category URL, and scraping mode together.
  • Use URL hash and article URL for deduplication across current and historic runs.
  • Route long article bodies to storage that can handle full text before sending summaries to spreadsheets or alerts.

Quality Checks

  • Review publisher coverage before relying on the workflow for monitoring.
  • Check that cutoff filtering behaves as expected in historic mode.
  • Sample articles across multiple publishers to confirm body text, dates, and metadata quality.

FAQ

Can this Actor collect both current and historic news?

Yes. The Actor supports current and historic execution modes, with a historic cutoff date available for older article collection.

Can I choose specific publishers?

Yes. The input schema supports selecting one or more websites from the active catalog, or leaving the selection empty to use the broader active catalog.

Can I target specific categories?

Yes. The Actor supports category targeting and manual site category filters for more controlled collection.

What kind of output does it produce?

The output schema includes source metadata, article title, author, body text, tags, publication date, URL fields, image URL, SEO description, scrape timestamp, scraping tool, execution mode, category URL, and cutoff filtering status.

Can The Scrape Lab deliver the data to my tools?

Yes. The Scrape Lab can clean and route the Actor output to CSV, Excel, Google Sheets, APIs, databases, Airtable, Notion, Slack, CRMs, or reporting workflows.

Is every publisher guaranteed to work?

No. Availability depends on publisher structure, public data access, archive availability, and technical complexity. Every workflow is reviewed before setup.

Need data collected or piped somewhere?

Send the source and fields. We'll review the scraper, Actor, or pipeline approach.

Request a Data Task