https://github.com/html5-ninja/page-replica

Skip to content
Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
html5-ninja / page-replica Public

  * Notifications
  * Fork 1
  * Star 128

Page Replica - Tool for Web Scraping, Prerendering, and SEO Boost

License

MIT license
128 stars 1 fork Activity
Star
Notifications

  * Code
  * Issues 0
  * Pull requests 0
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

html5-ninja/page-replica

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/h]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone html5-]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

zied hosni happy new year 2024
...
2a51214 Jan 1, 2024
happy new year 2024
2a51214

Git stats

  * 2 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
nginx_config_sample
happy new year 2024
December 31, 2023 19:45
.gitignore
Initial commit
December 31, 2023 17:23
LICENSE
Initial commit
December 31, 2023 17:23
README.md
happy new year 2024
December 31, 2023 19:45
api.js
happy new year 2024
December 31, 2023 19:45
index.js
happy new year 2024
December 31, 2023 19:45
package.json
happy new year 2024
December 31, 2023 19:45
View code
[                    ]
Page Replica Installation Usage Scraping Individual Pages Scraping
Sitemaps Serve the Cached Pages to Bots with Nginx (My Recipe)
Contribution Feature Requests and Suggestions Folder Structure

README.md

 Page Replica

"Page Replica" is a versatile web scraping and caching tool built
with Node.js, Express, and Puppeteer. It helps prerender web app
(React, Angular, Vue,...) pages, which can be served via Nginx for
SEO or other purposes.

The tool allows you to scrape individual web pages or entire sitemaps
trough an api, selectively removing JavaScript, and caching the
resulting HTML.

Additionally, it features an Nginx configuration that optimally
handles user and search engine bot traffic.

 Installation

 1. Clone the Repository:

    git clone https://github.com/html5-ninja/page-replica.git
    cd page-replica

 2. Install Dependencies:

    npm install

 3. Settings:

  * index.js

    const CONFIG = {
    baseUrl: "https://example.com",
    removeJS: true,
    addBaseURL: true,
    cacheFolder: "path_to_cache_folder",
    }

  * app.js : set the port for your API

 4. Start the API:

    npm start

 Usage

By scraping a page or a sitemap, a copy of the prerendered page will
be stored in the cache folder.

 Scraping Individual Pages

To scrape a single page, make a GET request to /page with the url
query parameter:

curl http://localhost:8080/page?url=https://example.com

 Scraping Sitemaps

To scrape pages from a sitemap, make a GET request to /sitemap with
the url query parameter:

curl http://localhost:8080/sitemap?url=https://example.com/sitemap.xml

 Serve the Cached Pages to Bots with Nginx (My Recipe)

In this case, the cached pages are served using Nginx. You can adapt
this configuration to your needs and your server.

The Nginx configuration, residing in nginx_config_sample/
example.com.conf, thoughtfully manages traffic. It efficiently routes
regular users to the main application server and redirects search
engine bots to a dedicated server block for cached HTML delivery.

Please review the nginx_config_sample/example.com.conf file to gain
an understanding of its functionality.

 Contribution

We welcome contributions! If you have ideas for new features or
server/cloud configurations that could enhance this tool, feel free
to:

  * Open an issue to discuss your ideas.
  * Fork the repository and make your changes.
  * Submit a pull request with a clear description of your changes.

 Feature Requests and Suggestions

If you have any feature requests or suggestions for server/cloud
configurations beyond Nginx, please open an issue to start a
discussion.

 Folder Structure

  * nginx_config_sample: Presents a sample Nginx configuration for
    redirecting bot traffic to the cached content server.
  * api.js: An Express application responsible for handling web
    scraping requests.
  * index.js: The core web scraping logic employing Puppeteer.
  * package.json: Node.js project configuration.

About

Page Replica - Tool for Web Scraping, Prerendering, and SEO Boost

Topics

frontend ssr seo-optimization prerendering

Resources

Readme

License

MIT license
Activity

Stars

128 stars

Watchers

1 watching

Forks

1 fork
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * JavaScript 100.0%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.