https://github.com/html5-ninja/page-replica
Skip to content
Toggle navigation
Sign in
* Product
+
Actions
Automate any workflow
+
Packages
Host and manage packages
+
Security
Find and fix vulnerabilities
+
Codespaces
Instant dev environments
+
Copilot
Write better code with AI
+
Code review
Manage code changes
+
Issues
Plan and track work
+
Discussions
Collaborate outside of code
Explore
+ All features
+ Documentation
+ GitHub Skills
+ Blog
* Solutions
For
+ Enterprise
+ Teams
+ Startups
+ Education
By Solution
+ CI/CD & Automation
+ DevOps
+ DevSecOps
Resources
+ Learning Pathways
+ White papers, Ebooks, Webinars
+ Customer Stories
+ Partners
* Open Source
+
GitHub Sponsors
Fund open source developers
+
The ReadME Project
GitHub community articles
Repositories
+ Topics
+ Trending
+ Collections
* Pricing
Search or jump to...
Search code, repositories, users, issues, pull requests...
Search
[ ]
Clear
Search syntax tips
Provide feedback
We read every piece of feedback, and take your input very seriously.
[ ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback
Saved searches
Use saved searches to filter your results more quickly
Name [ ]
Query [ ]
To see all available qualifiers, see our documentation.
Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
html5-ninja / page-replica Public
* Notifications
* Fork 1
* Star 128
Page Replica - Tool for Web Scraping, Prerendering, and SEO Boost
License
MIT license
128 stars 1 fork Activity
Star
Notifications
* Code
* Issues 0
* Pull requests 0
* Discussions
* Actions
* Projects 0
* Security
* Insights
Additional navigation options
* Code
* Issues
* Pull requests
* Discussions
* Actions
* Projects
* Security
* Insights
html5-ninja/page-replica
This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[ ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags
Name already in use
A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code
* Local
* Codespaces
*
Clone
HTTPS GitHub CLI
[https://github.com/h]
Use Git or checkout with SVN using the web URL.
[gh repo clone html5-]
Work fast with our official CLI. Learn more about the CLI.
* Open with GitHub Desktop
* Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
zied hosni happy new year 2024
...
2a51214 Jan 1, 2024
happy new year 2024
2a51214
Git stats
* 2 commits
Files
Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
nginx_config_sample
happy new year 2024
December 31, 2023 19:45
.gitignore
Initial commit
December 31, 2023 17:23
LICENSE
Initial commit
December 31, 2023 17:23
README.md
happy new year 2024
December 31, 2023 19:45
api.js
happy new year 2024
December 31, 2023 19:45
index.js
happy new year 2024
December 31, 2023 19:45
package.json
happy new year 2024
December 31, 2023 19:45
View code
[ ]
Page Replica Installation Usage Scraping Individual Pages Scraping
Sitemaps Serve the Cached Pages to Bots with Nginx (My Recipe)
Contribution Feature Requests and Suggestions Folder Structure
README.md
Page Replica
"Page Replica" is a versatile web scraping and caching tool built
with Node.js, Express, and Puppeteer. It helps prerender web app
(React, Angular, Vue,...) pages, which can be served via Nginx for
SEO or other purposes.
The tool allows you to scrape individual web pages or entire sitemaps
trough an api, selectively removing JavaScript, and caching the
resulting HTML.
Additionally, it features an Nginx configuration that optimally
handles user and search engine bot traffic.
Installation
1. Clone the Repository:
git clone https://github.com/html5-ninja/page-replica.git
cd page-replica
2. Install Dependencies:
npm install
3. Settings:
* index.js
const CONFIG = {
baseUrl: "https://example.com",
removeJS: true,
addBaseURL: true,
cacheFolder: "path_to_cache_folder",
}
* app.js : set the port for your API
4. Start the API:
npm start
Usage
By scraping a page or a sitemap, a copy of the prerendered page will
be stored in the cache folder.
Scraping Individual Pages
To scrape a single page, make a GET request to /page with the url
query parameter:
curl http://localhost:8080/page?url=https://example.com
Scraping Sitemaps
To scrape pages from a sitemap, make a GET request to /sitemap with
the url query parameter:
curl http://localhost:8080/sitemap?url=https://example.com/sitemap.xml
Serve the Cached Pages to Bots with Nginx (My Recipe)
In this case, the cached pages are served using Nginx. You can adapt
this configuration to your needs and your server.
The Nginx configuration, residing in nginx_config_sample/
example.com.conf, thoughtfully manages traffic. It efficiently routes
regular users to the main application server and redirects search
engine bots to a dedicated server block for cached HTML delivery.
Please review the nginx_config_sample/example.com.conf file to gain
an understanding of its functionality.
Contribution
We welcome contributions! If you have ideas for new features or
server/cloud configurations that could enhance this tool, feel free
to:
* Open an issue to discuss your ideas.
* Fork the repository and make your changes.
* Submit a pull request with a clear description of your changes.
Feature Requests and Suggestions
If you have any feature requests or suggestions for server/cloud
configurations beyond Nginx, please open an issue to start a
discussion.
Folder Structure
* nginx_config_sample: Presents a sample Nginx configuration for
redirecting bot traffic to the cached content server.
* api.js: An Express application responsible for handling web
scraping requests.
* index.js: The core web scraping logic employing Puppeteer.
* package.json: Node.js project configuration.
About
Page Replica - Tool for Web Scraping, Prerendering, and SEO Boost
Topics
frontend ssr seo-optimization prerendering
Resources
Readme
License
MIT license
Activity
Stars
128 stars
Watchers
1 watching
Forks
1 fork
Report repository
Releases
No releases published
Packages 0
No packages published
Languages
* JavaScript 100.0%
Footer
(c) 2024 GitHub, Inc.
Footer navigation
* Terms
* Privacy
* Security
* Status
* Docs
* Contact
* Manage cookies
* Do not share my personal information
You can't perform that action at this time.