Ultimate Web Novel & Manga Scraper
Institutional-Grade Documentation Edition This repository contains the Ultimate Web Novel & Manga Scraper, a comprehensive WordPress plugin designed to automate the ingestion of manga and web novel content. It is engineered to integrate seamlessly with the Madara theme, transforming a standard WordPress installation into a fully automated content aggregation platform. ---📖 Table of Contents
1. Project Overview 2. Feature Inventory 3. System Requirements 4. Technology Stack 5. Directory Overview 6. Installation 7. Environment Setup 8. Configuration 9. Database Setup 10. Admin & System Usage 11. Development Workflow 12. Production Deployment 13. Security Considerations 14. Limitations & Assumptions 15. Maintenance 16. Licensing ---1. Project Overview
The plugin operates as a "God Object" within the WordPress ecosystem, specifically targeting the Madara manga theme. It acts as a bridge between external content sources (MangaFox, WuxiaWorld, Madara-based sites, etc.) and the local WordPress database. It handles the entire lifecycle of content acquisition:- Scheduling: Cron-based execution.
- Fetching: Multi-mode scraping (cURL, PhantomJS, Puppeteer).
- Processing: HTML parsing, cleaning, and text spinning.
- Translation: Automated translation via Google/DeepL/Microsoft.
- Storage: Saving to local FS, DB, or Cloud Storage (S3).
2. Feature Inventory
- Multi-Source Scraping: Built-in rules for major manga/novel sites.
- Headless Browser Support: Renders JavaScript-heavy sites using PhantomJS or Puppeteer.
- Translation Pipeline: Converts content language on-the-fly.
- Proxy Support: Rotates proxies to bypass IP bans.
- Cloudflare Bypass: Mechanisms to handle anti-bot protection.
- Madara Enhancements: specialized module for cloning other Madara sites via AJAX.
- Auto-Update: Updates existing manga with new chapters automatically.
3. System Requirements
- CMS: WordPress 5.0+
- Theme: Madara (Active)
- Plugin Dependency: Madara Core (
WP_MANGA_STORAGE) - PHP: 7.4+
- Extensions:
curl,dom,mbstring,json,libxml - Optional:
* Node.js (for Puppeteer)
* PhantomJS binary
* shell_exec enabled
4. Technology Stack
- Language: PHP 7/8
- Frontend: jQuery (Admin UI)
- Parsers: PHP Simple HTML DOM Parser, DOMDocument
- Headless: PhantomJS (JS), Puppeteer (Node.js)
- Database: MySQL/MariaDB (WordPress Schema + Madara Custom Tables)
5. Directory Overview
See DIRECTORY_STRUCTURE.md for a complete manifest.root: Core logic (ultimate-manga-scraper.php).includes/: Madara integration classes.res/: Libraries, drivers, and admin UI templates.images/,scripts/,styles/: Assets.
6. Installation
See DEPLOYMENT.md for detailed steps. 1. Upload plugin to/wp-content/plugins/.
2. Activate via WordPress Admin.
3. Ensure Madara theme is active.
7. Environment Setup
- Permissions: Ensure the web server can write to
wp-content/uploadsandwp-content/plugins/ultimate-manga-scraper. - Cron: Disable WP-Cron and setup a system cron for reliability.
8. Configuration
See CONFIGURATION.md. Configuration is handled via Ultimate Web Novel & Manga Scraper -> Main Settings. Key areas:- Headless Settings: Paths to binaries.
- Translation Keys: API credentials.
- Storage Backend: Local vs Cloud.
9. Database Setup
The plugin utilizes the standard WordPresswp_options table for storing rules and settings. Content is stored in wp_posts (Manga) and wp_postmeta. Chapter data is managed by Madara's storage engine.
10. Admin & System Usage
1. Define Rules: Go to the specific scraper tab (e.g., Manga Scraper). 2. Add URL: Paste the TOC URL of the target manga. 3. Set Schedule: Define how often to check for updates. 4. Run: Click "Run This Rule Now" or wait for Cron. 5. Monitor: Watch the "Activity & Logging" tab.11. Development Workflow
- Architecture: See ARCHITECTURE.md.
- Data Flow: See DATA_FLOW.md.
- Modifying: Edits should primarily be made in
ultimate-manga-scraper.phpfor core logic, orincludes/for Madara-specific logic.
12. Production Deployment
- Security: See SECURITY.md.
- Optimization: Use Redis/Memcached object caching. Use a real Cron job.
13. Security Considerations
- SSRF: The plugin makes outbound requests to user-defined URLs.
- RCE:
shell_execis used for headless browsers. Secure your server accordingly. - Access Control: Restrict Admin access.
14. Limitations & Assumptions
- Theme Dependency: Assumes Madara theme structure is present.
- Site Changes: Scrapers rely on DOM structure. Target site changes will break scraping until updated.
- Legal: User is responsible for copyright compliance of scraped content.
15. Maintenance
- Logs: Rotate logs (
auto_clear_logs). - Updates: Check CHANGELOG.md.