https://github.com/html5-ninja/page-replica Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} html5-ninja / page-replica Public * Notifications * Fork 1 * Star 128 Page Replica - Tool for Web Scraping, Prerendering, and SEO Boost License MIT license 128 stars 1 fork Activity Star Notifications * Code * Issues 0 * Pull requests 0 * Discussions * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Discussions * Actions * Projects * Security * Insights html5-ninja/page-replica This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/h] Use Git or checkout with SVN using the web URL. [gh repo clone html5-] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit zied hosni happy new year 2024 ... 2a51214 Jan 1, 2024 happy new year 2024 2a51214 Git stats * 2 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time nginx_config_sample happy new year 2024 December 31, 2023 19:45 .gitignore Initial commit December 31, 2023 17:23 LICENSE Initial commit December 31, 2023 17:23 README.md happy new year 2024 December 31, 2023 19:45 api.js happy new year 2024 December 31, 2023 19:45 index.js happy new year 2024 December 31, 2023 19:45 package.json happy new year 2024 December 31, 2023 19:45 View code [ ] Page Replica Installation Usage Scraping Individual Pages Scraping Sitemaps Serve the Cached Pages to Bots with Nginx (My Recipe) Contribution Feature Requests and Suggestions Folder Structure README.md Page Replica "Page Replica" is a versatile web scraping and caching tool built with Node.js, Express, and Puppeteer. It helps prerender web app (React, Angular, Vue,...) pages, which can be served via Nginx for SEO or other purposes. The tool allows you to scrape individual web pages or entire sitemaps trough an api, selectively removing JavaScript, and caching the resulting HTML. Additionally, it features an Nginx configuration that optimally handles user and search engine bot traffic. Installation 1. Clone the Repository: git clone https://github.com/html5-ninja/page-replica.git cd page-replica 2. Install Dependencies: npm install 3. Settings: * index.js const CONFIG = { baseUrl: "https://example.com", removeJS: true, addBaseURL: true, cacheFolder: "path_to_cache_folder", } * app.js : set the port for your API 4. Start the API: npm start Usage By scraping a page or a sitemap, a copy of the prerendered page will be stored in the cache folder. Scraping Individual Pages To scrape a single page, make a GET request to /page with the url query parameter: curl http://localhost:8080/page?url=https://example.com Scraping Sitemaps To scrape pages from a sitemap, make a GET request to /sitemap with the url query parameter: curl http://localhost:8080/sitemap?url=https://example.com/sitemap.xml Serve the Cached Pages to Bots with Nginx (My Recipe) In this case, the cached pages are served using Nginx. You can adapt this configuration to your needs and your server. The Nginx configuration, residing in nginx_config_sample/ example.com.conf, thoughtfully manages traffic. It efficiently routes regular users to the main application server and redirects search engine bots to a dedicated server block for cached HTML delivery. Please review the nginx_config_sample/example.com.conf file to gain an understanding of its functionality. Contribution We welcome contributions! If you have ideas for new features or server/cloud configurations that could enhance this tool, feel free to: * Open an issue to discuss your ideas. * Fork the repository and make your changes. * Submit a pull request with a clear description of your changes. Feature Requests and Suggestions If you have any feature requests or suggestions for server/cloud configurations beyond Nginx, please open an issue to start a discussion. Folder Structure * nginx_config_sample: Presents a sample Nginx configuration for redirecting bot traffic to the cached content server. * api.js: An Express application responsible for handling web scraping requests. * index.js: The core web scraping logic employing Puppeteer. * package.json: Node.js project configuration. About Page Replica - Tool for Web Scraping, Prerendering, and SEO Boost Topics frontend ssr seo-optimization prerendering Resources Readme License MIT license Activity Stars 128 stars Watchers 1 watching Forks 1 fork Report repository Releases No releases published Packages 0 No packages published Languages * JavaScript 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.