[HN Gopher] MangaDex infrastructure overview
___________________________________________________________________
MangaDex infrastructure overview
Author : m45t3r
Score : 521 points
Date : 2021-09-07 03:50 UTC (19 hours ago)
(HTM) web link (mangadex.dev)
(TXT) w3m dump (mangadex.dev)
| probotect0r wrote:
| How is the Wireguard VPN set up? Has anyone used Wireguard to set
| up a VPN into AWS VPCs?
| germanjoey wrote:
| Many manga fans have a love/hate relationship with mangadex. On
| one hand, it's provided hosting for countless hours of
| entertainment over the years. Their "v3" version of the site was
| basically perfect from a usability point of view, to the point
| that the entire community chose to unite itself under its flag.
|
| On the other hand, directly because of the above, their hasty
| self-inflicted take down earlier this year nearly killed the
| entire hobby. Many series essentially stopped updating for the ~5
| months the site was down, and many more are likely never coming
| back again.
|
| The decision to suddenly take the site down for a full site
| rewrite feels completely inexplicable from the outside. (A
| writeup the above or the previous one[1], both of which read like
| they were written by a Google Product Manager, especially don't
| help as they conspicuously avoid any comment to the one question
| on everyone's mind: "leaving aside the supposed security issues
| with the backend, why on earth also rewrite and redesign the
| entire front end from scratch at the same time?")
|
| [1] https://mangadex.dev/why-rebuild/
| Jcowell wrote:
| I never got the sense that the manga community hated Mangadex
| and I've been following their Dev of V5 and the rise of other
| sites to use in their absence.
|
| It's seems weird to attribute Mangadex taking their site down
| for valid security concerns to the end of scandalization of
| certain series. That seems like entirely a Scan team problem if
| they decide not to upload via Cubari like other teams have
| done. And it doesn't even matter since a series can get sniped
| at anytime.
|
| It's makes entire sense that if you're going to rewrite the
| backend and API from scratch , you might as well do the front
| end too since it was a Goal from the beginning.
| nvarsj wrote:
| I didn't get that take at all from the why-rebuild link. It
| seems reasonable to me - legacy code base, hard to maintain,
| with security problems that led to the massive data leak a
| while back. They also don't owe anything to anyone, and as a
| hobbyist project, they wanted to try something new. I'm
| impressed as they seem to have managed it - and the new site
| feels a lot more responsive than the old one.
| creshal wrote:
| Yeah. This whole mess pushed me to moving everything I had (or
| could remember, anyway) to Tachiyomi1, so I can hop between
| hosting websites freely without losing progress or access to
| old chapters (as long as I don't run out of local storage).
|
| And while it works fine for reading, it kills any interaction
| with the hosting sites. No chance for monetization,
| socialization or anything else that can help sites survive
| long-term.
|
| [1] https://tachiyomi.org/
| m45t3r wrote:
| Before MangaDex we had Batoto (the old Batoto before some
| sketchy company bought the name), that was kinda of the same:
| serving high quality manga for most scanlators that wanted it
| (and also avoiding hosting pirated chapters from official
| sources, so kinda the same as MangaDex nowadays). As far I
| remember Batoto closed because of pressure from companies and
| also because of the high costs related to run the site.
|
| So yeah, considering how fragile maintaining a site like this
| is, it is always a good idea to sync your progress in a third
| party so it is easier to migrate if something goes wrong.
|
| > And while it works fine for reading, it kills any
| interaction with the hosting sites. No chance for
| monetization, socialization or anything else that can help
| sites survive long-term.
|
| BTW, MangaDex doesn't have monetization because it is strict
| a hobby and also because it is a gray area to monetize about
| this kinda of work [1]. Also, their Tachiyomi client is
| official (MangaDex v5 API was tested primarily via their
| Tachiyomi client before they finished the Web interface).
|
| [1]: both for companies (that has the copyright from the
| works hosted on those sites) and the scanlators (the fans
| that does actual work of translating those chapters). Sites
| that host those chapters and monetize are pretty much
| monetizing on work from other people.
| majora2007 wrote:
| Totally a self plug, but if you're looking to take it a step
| further, Kavita is a great program to host your own, Plex-
| like manga server.
|
| https://kavitareader.com
| creshal wrote:
| That actually looks really interesting, thanks!
| vymague wrote:
| > their hasty self-inflicted take down earlier this year nearly
| killed the entire hobby
|
| It won't kill the hobby. Because these scanlators are making
| mad money from ads, patreon, crypto mining. I'll never get why
| they don't get more aggressive take down notices from
| Chinese/Japanese/Korean publishers.
| level3 wrote:
| They get plenty of takedown notices, but they mostly get to
| hide behind services like Cloudflare who won't take action
| regarding these notices anyway. From the publishers/creators
| side, there is simply no effective way to take scanlators
| down.
| kmeisthax wrote:
| Copyright enforcement is actually quite expensive, both for
| the litigant and the defendant. The only way for it to be
| actually profitable to sue someone who is stealing your work
| is if they immediately settle, which is how copyright trolls
| operate. Everything else is a massive money pit for everyone
| involved, even the lawyers. Since this is an international
| enforcement action, the costs go up _more_ , because now you
| need _multiple_ legal teams on the bar in each jurisdiction,
| translators qualified for interpreting laws in foreign
| languages, knowledge of local copyright quirks, and a lot
| more coordination than just asking your local counsel to send
| a takedown notice locally.
|
| (Just as an example of a local copyright quirk that will
| probably confuse a lot of people in the audience from Europe:
| copyright registration. America really, _really_ wants you to
| register your copyright, even though they signed onto Berne
| /WTO/TRIPS which was supposed to abolish that regime
| entirely. As a result, America did the bare minimum of
| compliance. You don't lose your copyright if you don't
| register, but you can't sue until you do, and if you register
| after your work was infringed, you don't get statutory
| damages... which means your costs go way up.)
|
| Furthermore, every enforcement action you take risks PR
| backlash. The whole fandom surrounding import Japanese comic
| books basically grew out of a piracy scene. Originally, there
| were no English translations, and the scene was basically
| reusing what we'd now call "orphan works". There used to be
| an unspoken rule among most fansubbers of not translating
| material that was licensed in the US. All that's changed;
| most everything gets licensed and many fan translators
| absolutely are stepping on the toes of licensees. However,
| every time a licensee or licensor actually takes an
| enforcement action, they get huge amounts of blowback from
| their own fans.
| chaorace wrote:
| I suspect it's because the international market for print
| manga (the primary cash cow) is rather anemic, particularly
| compared to anime.
|
| Publishers see the loss as minimal and creators see piracy as
| free advertising to drum up enthusiasm for anime adaptations,
| which actually _do_ drum up decent profits internationally (
| _the committee keeps the streaming licensing fees, not the
| animation studio_ ).
| level3 wrote:
| Publishers definitely don't see it that way; that's mostly
| an extension of a myth in order to justify the piracy.
|
| Most manga publishers will see relatively little revenue
| from international anime releases. Even for domestic anime
| releases of the vast majority of titles, the manga
| publisher is only a small part of the anime production
| committee, and the hope is mostly that popularity of the
| anime can lead to increased sales of the manga,
| merchandise, or other events. So when the anime is released
| internationally, they get an even smaller cut of that
| because the international licensee also has to take their
| profit.
|
| But other than mega-hit titles where an international anime
| release may also lead to significant international manga
| sales, the popularity of an anime adaptation overseas is
| practically irrelevant to the original manga publisher.
| FpUser wrote:
| I wrote business backend server that calculates various things
| and returns results as json. It serves in average up to 5000
| requests/s for about $220CDN / month. Architecture - single
| executable written in C++ () running on rented dedicated server
| from OVH. 16 cores, 128GB RAM and couple of SSDs.
|
| It can do much higher requests per second wise on simple requests
| but most common requests are actually heavy iterative
| calculations so hence the average of 5000 requests/s
| cyberpsybin wrote:
| At some point in new re-design, they started to load full size
| images for thumbnails. The whole site feels slower due to that.
| Need an automatic re-scaler service.
| tristan9 wrote:
| Not correct, we generate 2 thumbnails sizes for every cover --
| if the site loads full-size anywhere by default (rather than
| when you expand it) it's definitely a bug!
| maxk42 wrote:
| I run an Alexa top-2000 website. (Mangadex is presently at about
| 6000.) I spend less than $250 a month.
|
| I have loads and loads of thoughts about what they could be doing
| differently to reduce their costs but I'll just say that the
| number one thing Mangadex could be doing right now from a cursory
| glance is to reduce the number of requests. A fresh load of the
| home page generates over 100 requests. (Mostly images, then
| javascript fragments.) Mangadex apparently gets over 100 million
| requests per day. My site - despite ostensibly having more
| traffic - gets fewer than half that many in a month. (And yes,
| it's image-heavy.)
|
| A couple easy wins would be to reduce the number of images loaded
| on the front page. (Does the "Seasonal" slider really need 30
| images, or would 5 and a link to the "seasonal" page be enough?
| Same thing with "Recently Added" and the numbers of images on
| pages in general.) The biggest win would probably be reducing the
| number of javascript requests. Somehow people seem to think
| there's some merit to loading javascript which subsequently loads
| additional javascript. This adds a tremendous amount of latency
| to your page load and generates needless network traffic. Each
| request has a tremendous amount of overhead - particularly for
| dynamically-generated javascript. It's much better to load all of
| the javascript you need in a single request or a small handful of
| requests. Unfortunately, this is probably a huge lift for a site
| already designed in this way, but the improved loading time would
| be a big UX win.
|
| Anyway - best of luck to MangaDex! They've clearly put a lot of
| thought into this.
| tristan9 wrote:
| Hi, we're trying to lower the requests:pageview ratio in
| general, but for what it's worth this article essentially:
|
| - ignores the vast majority of "image serving" (most is handled
| by DDG and our custom CDN)
|
| - the JS fragments thankfully should load only on first visit
| and then get aggressively cached by DDG/your browser
|
| One of the pain points is that there are a lot of settings for
| users to decide what they should or shouldn't see (content
| rating, original language of origin, search tags, etc) and some
| are already specifically denormarlized (when querying chapter
| entities, ES indices for those contain some manga-level
| properties to avoid needing to dereference that first too) --
| however this also makes caching substantially less efficient in
| many places, alas
|
| Thanks!
| [deleted]
| the8472 wrote:
| One issue I see is that flipping back and forth between
| chapters reloads images from different URLs which means
| they're uncachable. I guess that's somehow related to the
| mangadex@home thing, but if the URLs were generated in a more
| deterministic manner (keyed on some client ID + the chapter
| being loaded) then the browser could avoid redundant traffic.
| tristan9 wrote:
| That's very close to how MD@H works, but it also has a time
| component and tokens are not generated by our main
| backends, so it'd require a separate internal http call per
| chapter
| the8472 wrote:
| Another thing. For each page that's being loaded there's
| a report being sent. Instead this could be aggregated
| (e.g. once a second) and then processed as a batch on the
| server side which should be faster.
|
| And if your JS assets are hashed then you can add cache-
| control: immutable so that a browser doesn't have to
| reload them when the user F5s.
| jiggawatts wrote:
| Hi, I'm a performance tuning expert, and this thread piqued
| my interest.
|
| The _first thing_ that I noticed is that even with caching
| enabled, you 're loading "too much data". After loading the
| main page and then clicking one of the tiles, there are
| several JSON API calls.
|
| Here's an example, 195 kB transferred (528 kB size): https://
| api.mangadex.org/manga/bbaa17c4-0f36-4bbb-9861-34fc8...
|
| Oof. Half a _megabyte_ of JSON! Ignore the network traffic
| for a moment, because GZIP does wonders. The real problem is
| that generating that much JSON is very "heavy" on servers.
| Lots and lots of small object allocations, which gives the
| garbage collector a ton of work to do. It's also expensive to
| decode on the browser for similar reasons.
|
| On my computer, this took a whopping 455ms to transfer,
| nearly half a second. That results in a noticeable latency
| hit to the site.
|
| In my consulting gig I always give developers the same
| advice: "Displaying 1 kilobyte of data should take roughly 1
| kilobyte of traffic".
|
| In other words, there's _isn 't 500 KB of text anywhere on
| that page!_ A quick cut & paste shows about 8 KB of user-
| visible text in the final HTML rendering. That's a 1:60 ratio
| of content-to-data, which is very poor. I bet that behind the
| scenes, this took a heck of a lot more back-end network
| traffic and in-memory processing to generate. Probably tens
| to hundreds of megabytes of internal traffic, all up.
|
| This is one of the core reasons most sites have difficulty
| scaling, because for every kilobyte of content output to the
| screen, they're powering through megabytes or even gigabytes
| of data behind the scenes.
|
| Can this API query be cut down to match what's displayed on
| the screen? Can it be cached for all users? Can it be cached
| _precompressed_?
|
| Etc...
| tristan9 wrote:
| > The real problem is that generating that much JSON is
| very "heavy" on servers. Lots and lots of small object
| allocations, which gives the garbage collector a ton of
| work to do. It's also expensive to decode on the browser
| for similar reasons.
|
| For what it's worth, this isn't generated live but a mix of
| existing entity documents
|
| Most of it is page filenames which indeed could be made
| optional and fetched only by the reader, but that'd be us
| actively nulling them out in the returned entity, since
| they are there in the ES documents for the chapters (a
| manga feed like this being a list of chapters)
| jiggawatts wrote:
| You're basically dumping down a database to the web
| browser, including all of the internal metadata that's
| likely irrelevant to rendering the HTML.
|
| For example, user role memberships: {
| "id": "c80b68c5-09ae-4a50-a447-df7c5a4a6d01",
| "type": "user", "attributes": {
| "username": "kinshiki", "roles": [
| "ROLE_MEMBER", "ROLE_GROUP_MEMBER",
| "ROLE_POWER_UPLOADER" ],
| "version": 1 } }
|
| Also record timestamp dates like created/changed, along
| with contact details that may be revealing sensitive
| info: "attributes": {
| "name": "SENPAI TEAM", "locked": true,
| "website": "https:\/\/discord.gg\/84e3j9b",
| "ircServer": null, "ircChannel": null,
| "discord": "84e3j9b", "contactEmail":
| "senpai.info@gmail.com", "description": null,
| "official": false, "verified": false,
| "createdAt": "2021-04-19T21:45:59+00:00",
| "updatedAt": "2021-04-19T21:45:59+00:00",
| "version": 1 }
|
| But let's just go back to your response:
|
| > Most of it is page filenames which indeed could be made
| optional
|
| Do that! If you strip them out, the 529 kB document
| shrinks to 280 kB, which hardly seems worth the hassle,
| but when gzipped, this is a miniscule 13 kB! This is
| because those strings are _hashes_ , which
| _significantly_ reduces their compressibility compared to
| general JSON, which usually compresses very well.
|
| It's basic stuff like this that can make a website
| absolutely fly.
|
| Avoid giving computers unnecessary, mandatory work:
| https://blog.jooq.org/many-sql-performance-problems-stem-
| fro...
| tristan9 wrote:
| As I said, it's not so much that we ask that data to be
| fetched -- it is there in the first place, and pulled
| from Elasticsearch, not a SQL database
|
| Because of this model, we also make sure that
| Elasticsearch merely works a search cache, not as an
| authoritative content database (hence everything we add
| in there is considered public, on purpose, and what isn't
| meant to be public is just not indexed in ES)
|
| However the gzip efficiency improvements would be really
| neat for sure
|
| Fwiw I also don't work on the backend and there might be
| good reasons to not expressly filter out data (yet
| anyway, perhaps it will end up as a separate entity and
| be a include parameter)
| clambordan wrote:
| You can query Elastic for specific fields only: https://w
| ww.elastic.co/guide/en/elasticsearch/reference/curr...
|
| Edit: As you said, there may be reasons on the backend
| not to filter things out of the query. Though it seems
| likely that the web response could be trimmed down.
| BizarroLand wrote:
| I have to say I'm glad this is being talked about in a
| public forum. Outsiders rarely get to see brainstorming,
| troubleshooting & group discussion of technological
| issues like this.
|
| Someone who is focused on the performance aspect &
| someone who is focused on stack stability discussing the
| real world input & output of a business system and
| showing why performance & UX are not the only metrics
| that matter is a good thing for us to see.
| kmeisthax wrote:
| This seems less like a performance problem and more of a
| security issue. Especially considering that this is a
| website that hosts unlicensed translations. How much of
| this information is actually intended to be made public?
| krick wrote:
| > Displaying 1 kilobyte of data should take roughly 1
| kilobyte of traffic
|
| Is this to be taken literally? I don't consider myself a
| performance-tuning expert, but I'm not sure how can I make
| something useful out of this advice. Of course, "the less
| you transfer, the better" is an obvious thing to say (a bit
| too obvious to be useful, in fact), but does it really mean
| I should aspire to transfer only what I'm actually going to
| display right now? For example, there is a city
| autocomplete form on the page (well, a couple of thousand
| relatively short entries). In that case I would probably
| consider making 1 request to fetch all these cities (on
| input focus, most likely), instead of making a request to
| the server on every couple of characters you type. Is it
| actually a wrong way of thinking?
| baybal2 wrote:
| > This is one of the core reasons most sites have
| difficulty scaling, because for every kilobyte of content
| output to the screen, they're powering through megabytes or
| even gigabytes of data behind the scenes.
|
| > Can this API query be cut down to match what's displayed
| on the screen? Can it be cached for all users? Can it be
| cached precompressed?
|
| This is why you want to bypass the JS realm, (or whatever
| language does the serdes) and send clients JSON or XML
| directly from the database, so the client is only getting
| the data at rest.
| maxk42 wrote:
| > the JS fragments thankfully should load only on first visit
| and then get aggressively cached by DDG/your browser
|
| According to Alexa you have a 46.4% bounce rate. [1]
|
| When 46% of your users aren't coming back, how does 31 round-
| trips to your server for 100% of first-page visitors save
| anyone time or bandwidth? Your pageviews per visitor is 6.8,
| meaning the 53.6% that stick around view an average of 11.8
| pages each. Even if there are zero subsequent js requests on
| other pages (clicking a random page I see 8) you would be
| generating 31 requests up-front to save 10.8 subsequent
| requests for about half of your users. (And again - in any
| scenario where the number of js fragments transferred on
| subsequent requests >= 1 even this benefit goes out the
| window.) How does that save you or your users bandwidth,
| server load, or other overhead?
|
| The scale is not quite linear, but generally speaking, if you
| get your number of requests down from > 100 to < 5, you'll be
| able to handle around 20x the traffic with the same number of
| web-facing servers. Or alternatively the same amount of
| traffic with around 1 / 20th the servers.
|
| Would that have a material effect on your costs?
|
| [1] https://www.alexa.com/siteinfo/mangadex.org
| tristan9 wrote:
| Definitely needs optimising for user experience indeed!
|
| However the serving of this JS has nearly no cost to us (as
| they are cached at the edge by DDoS-Guard and the frontend
| is otherwise entirely static on our end)
| tyingq wrote:
| >A fresh load of the home page generates over 100 requests.
|
| I see 17 requests, all over either h2 or h3. 4 of them JS, and
| 2 images.
| maxk42 wrote:
| Then you're not doing a fresh load of the page. There are
| over 30 images visible on the front page, so your measure
| doesn't pass the smell test, does it?
| tyingq wrote:
| >Then you're not doing a fresh load of the page
|
| Nope. Different problem.
|
| The article was linked to a page under the domain
| "mangadex.dev".
|
| Without any other context, I had assumed "home page" meant
| http://mangadex.dev , or what I got when clicking "Home" on
| the linked article.
|
| Apparently not.
| radicalbyte wrote:
| Do you manage to get as many buzz-words and OSS products into
| your system as they do? :)
|
| In general the less moving parts you have in a system the more
| reliable, secure, efficient and cheaper the system becomes.
|
| In their case they run a site that is probably under constant
| attack by the "hired goons", so they're going to need to have
| more moving parts than others. Plus they will want to optimise
| for minimal development time (it's a hobby) so just adding
| another tried and trusted system into the stack to do something
| you need makes sense.
| starfallg wrote:
| >In their case they run a site that is probably under
| constant attack by the "hired goons", so they're going to
| need to have more moving parts than others.
|
| That's taken care of by the DDoS-Guard system they placed
| fronting their infrastructure. The design of their system has
| to take this into account, but that is mainly on a IP and DNS
| level. The design of their stack behind the loadbalancer is
| mainly driven by their functional and non-functional
| requirements, rather than by the need to prevent DDoS
| attacks.
| radicalbyte wrote:
| The layering - defence in depth - is very much a security
| consideration. Especially if you're building a pure
| request/response/sync system you need that. Or you decouple
| with a queue for mutations and avoid a lot of issues.
| starfallg wrote:
| That may be in terms of managing general security,
| especially with regards to the attack surface of the
| solution, but here we are talking about DDoS, which is
| mostly a separate topic and handled on the network level
| (for volumetric attacks) and load-balancer level (for
| non-volumetric attacks) or a combination of both.
| maxk42 wrote:
| lol
|
| > In general the less moving parts you have in a system the
| more reliable, secure, efficient and cheaper the system
| becomes.
|
| 100% agreed. This is not my first high-traffic site, nor even
| the highest. (I built the analytics system for a an Alexa
| top-10 site in 2010, reaching some 30 billion writes / day
| off of a mere 14 small ec2 instances.) I've never seen a k8s
| implementation in production that was necessary.
|
| I will note that my Alexa-2k site is also a personal site (no
| revenue) and under constant attack. In fact we frequently
| suffer DDOSes that we don't even notice until reviewing the
| logs later because it doesn't suffer any latency under
| pressure.
| radicalbyte wrote:
| Interesting, wouldn't mind having a chat outside of HN if
| you're interested (see my profile for mail).
|
| I've spent much of my career working on systems with active
| users from the hundreds to low thousands, but which process
| a huge number (50k/sec scale) jobs/tasks.
|
| It's a totally different kettle of fish, and if I'm totally
| honest I'm shocked at how badly "web" scales and how common
| these naive and super inefficient implementations are
| (hint: my bare-metal server from 2005 was faster than
| expensive cloud VMs).
|
| Recently I've worked on two high-usage systems (one of
| which was "handling" 30k requests/second for the first
| couple of week).
| golergka wrote:
| > I've spent much of my career working on systems with
| active users from the hundreds to low thousands, but
| which process a huge number (50k/sec scale) jobs/tasks.
|
| MMO games, by any chance?
| [deleted]
| Folcon wrote:
| Would you mind outlining your approach?
|
| Really interested to see how you think about this sort of
| thing =)...
| maxk42 wrote:
| My approach to what?
| Folcon wrote:
| To architecting a high traffic site =)...
| Cipater wrote:
| He posted a reply to his own comment.
|
| https://news.ycombinator.com/item?id=28443113
| maxk42 wrote:
| Actually, my reply was to Folcon. HN simply doesn't allow
| you to reply to comments beyond a certain depth
| sometimes.
|
| Perhaps mods have the ability to extend this for active
| discussions and that's why I can reply now?
| detaro wrote:
| it's timing based. you can always reply by going to the
| permalink of the comment you want to reply to.
| maxk42 wrote:
| Couldn't reply to this comment - but sure enough, the
| permalink gives me the option. Thank you for the info!
| detaro wrote:
| Yeah, it's a somewhat well-meaning feature (supposed to
| slow down flamewars) that is extremely unintuitive
| maxk42 wrote:
| (1) Simple beats complex.
|
| (2) You can spend weeks building complex infrastructure
| or caching systems only to find out that some fixed C in
| your equation was larger than your overhead savings. In
| other words: Measure everything. In other other words:
| Premature optimization is the root of all evil.
|
| (3) Fewer moving parts equals less overhead. (Again:
| Simple beats complex.) It also makes things simpler to
| reason about. If you can get by without the fancy
| frameworks, VMs, containers, ORM, message queues, etc.
| you'll probably have a more performant system. You need
| to understand what each of those things does and how and
| why you're using them. Which brings me to:
|
| (4) Learn your tools. You can push an incredible amount
| of performance out of MySQL, for instance, if you learn
| to adjust its settings, benchmark different DB engines
| for your application, test different approaches to
| building your schemas, test different queries, make use
| of tools like the EXPLAIN statement, etc. you'll probably
| never need to do something silly like make half a dozen
| round-trips to the database in a single page load.
|
| (5) Understand your data. Reason about the data you will
| need before you build your application. If you're working
| with an existing application, make sure you are very
| familiar with your application's database schema. Reason
| ahead of time about what requirements you have or will
| have, and which data will be needed simultaneously for
| different operations. Design your database tables in such
| a way as to minimize the number of round-trips you will
| need to make to the database. (My rule of thumb: Try to
| do everything in a single request per page, if possible.
| Two is acceptable. Three is the maximum. If I need to
| make more than three round-trips to the database in a
| single page request, I'm either doing something too
| complex or I seriously need to rethink my schema.)
|
| (6) Networking is slow. Minimize network traversal. Avoid
| relying on third-party APIs where possible when
| performance counts. Prefer running small databases local
| to the web server to large databases that require network
| traversal to reach. This is how I handled 30 billion
| writes / day: 12 web servers with separate MySQL
| instances local to each sharded on primary key IDs. The
| servers continuously exported data to an "aggregation"
| server, which was subsequently copied to another server
| for additional processing. Having the web server and
| database local to the same VM meant they didn't need to
| wait for any network traversal to record their data. I
| could've easily needed several times as many servers if I
| had gone with a traditional cluster due to the additional
| latency. When you need to process 25,000 events in a
| second, every millisecond counts.
|
| (7) Static files beat the hell out of databases for read-
| only performance. (Generally.)
|
| (8) Sometimes you can get things moving even faster by
| storing it in memory instead of on disk.
|
| (9) Reiterating what's in (3): Most web frameworks are
| garbage when it comes to performance. If your framework
| isn't in the top half of the Techempower benchmarks, (or
| higher for performance-critical applications) it's
| probably going to be better for performance to write your
| own code if you understand what you're doing. Link for
| reference: https://www.techempower.com/benchmarks/ Note
| that the Techempower benchmarks themselves can be
| misleading. Many of the best performers are only there
| because of some built-in caching, obscure language hack,
| or standards-breaking corner-cutting. But for the
| frameworks that aren't doing those things, the benchmark
| is solid. Again, make sure you know your tools and _why_
| the benchmark rating is what it is. Note also that some
| entire languages don 't really show up in the top half of
| techempower benchmarks. Take that into consideration if
| performance is critical to your application.
|
| (10) Most applications don't need great performance.
| Remember that a million hits a day is really just 12 hits
| per second. Of course the reality is that the traffic
| doesn't come in evenly across every second of the day,
| but the point remains: Most applications just don't need
| that much optimization. Just stick with (1) and (2) if
| you're not serving a hundred million hits per day and
| you'll be fine.
| Folcon wrote:
| Thanks, this is a good list in general of things to think
| about =)...
|
| I've not really ever applied 9 myself, I've run
| comparative benchmarks a couple of times, but not thought
| about using that as a basis for whether to roll my own on
| critical performance parts.
| probotect0r wrote:
| Can you share how you do logging/monitoring/alerting for
| your site?
| maxk42 wrote:
| Bash scripts and cron. Automatic alerts go out to devs
| via OpsGenie when resource availability drops so we can
| get out ahead of it. 0 seconds of downtime in the past 12
| months.
| arthur_sav wrote:
| > Simple beats complex. > Fewer moving parts equals less
| overhead.
|
| Took me almost a decade to really comprehend this.
|
| I used to include all sorts of libraries, try out all the
| fancy patterns/architectures etc...
|
| After countless of hours debugging production issues...
| the best code i've ever written is the one with the fewer
| moving parts. Easier to debug and the issues are
| predictable.
| kiba wrote:
| "The best part is no part." is an engineering quote I
| heard.
| arethuza wrote:
| I'm sure I've heard something like "engineering is
| solving problems while doing as little new as possible".
| vymague wrote:
| > But for the frameworks that aren't doing those things,
| the benchmark is solid.
|
| Any example of such frameworks?
| manigandham wrote:
| (ASP).NET is solid. Extremely fast, very reliable, and
| highly productive.
|
| https://dotnet.microsoft.com/apps/aspnet
| arethuza wrote:
| "Simple beats complex."
|
| In the very first lecture of the Computer Science degree
| I did in the 1980s the lecturer emphasised KISS, and said
| that while we almost certainly wouldn't believe it at
| first eventually we'd realise that this is the most
| important design principle of all. Probably took me ~15
| years... ;-)
| politelemon wrote:
| Sadly I think this is a lesson that we as an industry
| consistently keep unlearning.
| polote wrote:
| I don't know about you, but they have 42 average Page-views per
| visit (HN has 3) so Alexa rank is going to be biased
| sxhunga wrote:
| Interesting to know more about news.ycombinator.com !
| jiggawatts wrote:
| What kills me is that this was a rather pedestrian outcome on a
| much cheaper 2-core virtual machine back in 2007 or so.
|
| I easily got 3K requests / sec out of my _laptop_ at the same
| time, and it was not a trivial app!
|
| People's expectations have shifted so much it's absurd. If you
| look at the TechEmpower benchmarks, ordinary VMs can easily push
| 100K requests per second, no sweat, even with managed languages.
|
| Trivial stuff like static content being treated _as static
| content_ (files on the disk!) not as a distributed cache in front
| of a database can do wonders.
|
| Am I just old and jaded?
| [deleted]
| z3t4 wrote:
| 2k sockets on a test bed vs 2k real user request in production
| is very different. I doubt you ran a top 1000 Alexa site on
| your laptop. Today we need to deal with SSL which eats from the
| performance budget.
| jiggawatts wrote:
| > SSL which eats from the performance budget.
|
| That was a short-lived thing, and has now become a myth
| perpetuated by companies like Citrix and F5 that sell "SSL
| offload" appliances for $$$.
|
| Have you benchmarked the overhead of TLS?
|
| In my experience, a _single CPU core_ can easily put out
| multiple gigabytes of AES-256 (tens of gigabits). This
| benchmark shows 3 GB /s (24 Gbps) for recent AMD CPUs, and
| nearly 40 Gbps per core for an Intel CPU:
| https://calomel.org/aesni_ssl_performance.html
|
| A multi-core server is very unlikely to have more than a 1-5%
| overheard due to TLS. Even connection set up is a minor
| overhead with elliptic curve certificates.
|
| This is thanks to the AES offload instructions, which are
| present in all server CPUs made any time in the last 5-7
| years or so. As long as the modern Galois Counter Mode (GCM)
| is used with AES, performance should be great.
|
| Meanwhile, Citrix ADC v13 with a hardware "SSL offload card"
| _actually slows down_ connections! I had a very hard time
| getting more than 600 Mbps through one. It seems to be the
| way the ASIC offload chip is architected: it seems to use a
| large number of slow cores, a bit like a GPU. This means that
| any one TLS stream will have its bandwidth capped!
| tpetry wrote:
| The problem with these benchmarks is they measure the
| bandwidth you can push through an established tls
| connection. Try to build 2000 new tls connections a seconds
| (yes many are still active and dont need to be restarted)
| that is what is the really slow part. Not sending the data
| over already established channels.
| adreamingsoul wrote:
| No. I think you have a healthy perspective and we should all be
| questioning if current trends are beneficial/sustainable.
|
| I haven't read the article, but the headline alone to me seems
| alarming, $1,500 a month is a lot of money for only 2k rps.
| reilly3000 wrote:
| Maybe its a lot of money just for the web servers, but for
| the entire infrastructure stack its pretty reasonable IMHO.
| Vosporos wrote:
| Then you should read the article ;)
| rk06 wrote:
| There is more to it that http request response. Mangadex also
| need to store a lot of images and distribute them.
| hnlmorg wrote:
| CDNs have already solved this problem and are much cheaper
| than $1500/month.
|
| I've ran far more complex sites with much higher traffic
| for less.
| rk06 wrote:
| Mangadex can't use cloudflare because of privacy reasons.
| They may be facing similar issues with other popular
| CDNs.
|
| I am sure they must be using some kind of CDN for sure,
| however, those options are unlikely to be free
| hnlmorg wrote:
| Privacy reasons? It's all static content that is publicly
| accessible. I don't understand what the privacy reasons
| could be under this context.
|
| Are they worried about CDNs logging the images their
| visitors access? Seems like an absurd edge case to worry
| about in my opinion.
|
| > _however, those options are unlikely to be free_
|
| I wasn't even talking about free CDNs :)
| vymague wrote:
| I think privacy as in the mangadex team don't want to get
| sued. So they avoid popular services who are more
| willingly share their identity.
| nrabulinski wrote:
| They're basically hosting illegal content, or at least a
| good chunk of it is copyright-infringing so they cannot
| use cloudflare or any of the other off the shelf
| offerings
| hnlmorg wrote:
| I see. That does complicate things somewhat.
|
| I wonder if there's merit in them approaching studios
| with a proper business plan?
| Jcowell wrote:
| Now with the way the manga, Manghwa, and WEBTOON industry
| is tied up. But I believe that is their end goal
| eventually.
| level3 wrote:
| Hm? There are lots of copyright-infringing sites using
| Cloudflare, and Cloudflare seems pretty content to
| generally ignore infringement notices.
| ricktdotorg wrote:
| Not disputing your statement, just made me laugh a bit
| because almost every single site I visit recently that
| offers links to copyrighted content [stored in file
| lockers] sits behind Cloudflare
| willvarfar wrote:
| Excellent post, good technical content, amazing feat.
|
| That said, I echo that the amazing feat is that they can fit
| modern inefficient tool choices with poor mechanical sympathy
| into that budget. The last decade of web-dev tooling has been
| pushing the TCO of systems through the roof and this post is
| all about how to struggle against that whilst using those
| tools.
|
| If they went old-school they'd get another order of magnitude
| savings. Many veterans know of systems doing 10x that in 10x
| less cost. Remember C10K was in 1999.
| vymague wrote:
| > If they went old-school they'd get another order of
| magnitude savings. Many veterans know of systems doing 10x
| that in 10x less cost. Remember C10K was in 1999.
|
| How to learn more about the old-school way without getting a
| job related to it? Like, topic or book recommendations.
| thoughtFrame wrote:
| I'm also curious about this, since in many areas of
| computing (not only webdev) the old-school guys take some
| stuff as so obvious that they don't even bother writing
| about it or explaining it beyond "This is obviously
| possible, dude". They know how to achieve this level of
| performance, but for everyone else, we have to cobble
| together fragmented insights. So if anyone out there reads
| this and thinks like the GP, please do write about it ;)
| hsn915 wrote:
| What topics specifically are you interested in? And where
| do all the people like you hang out?
|
| I'm not an old-school guy by any means .. but I might
| have something to contribute.
| vymague wrote:
| > What topics specifically are you interested in?
|
| Well, for example, what's the old-school alternative to
| mangadex's solution?
|
| > And where do all the people like you hang out?
|
| We are here on HN.
| colesantiago wrote:
| Not just you, but if it works for them, that's completely fine.
|
| But there are many ways to achieve 20K RPS without this type
| architecture and especially without k8s, for less than $1,500.
| bob1029 wrote:
| >20k RPS.
|
| If this metric is what you are chasing, there are ways to
| reliably break 1 million RPS using a single box if you _don
| 't_ play the shiny BS tech game. The moment you involve
| multiple computers and containers, you are typically removed
| from this level of performance. Going from 2,000 to 2,000,000
| RPS (serialized throughput) requires many ideological
| sacrifices.
|
| Mechanical sympathy (ring buffers, batching, minimizing
| latency) can save you unbelievable amounts of margin and time
| when properly utilized.
| nine_k wrote:
| I frankly don't see where containers could lower the
| performance.
|
| Basically a container is a glorified chroot. It has the
| same networking unless you asked for isolation, then
| packets have to follow a local (inside the host) route. It
| has exactly no CPU or kernel interface penalty.
|
| Maybe you wanted to say about container orchestration like
| k8s, with its custom network fabric, etc.
| krageon wrote:
| > I frankly don't see where containers could lower the
| performance.
|
| Have you seen most k8s deployments? It's not the
| containers, it's the thoughtspace that comes with them.
| Even just using bare containers invites a level of
| abstraction and generally comes with a type of developer
| that just isn't desirable.
| bob1029 wrote:
| Even loopback is significantly slower than a direct
| method invocation.
| genewitch wrote:
| In 2011 a company i contracted for was testing some new dell 1U
| servers with around 1-2TB of ram. There was a postgres database
| with 4000qps that could fit into tmpfs, and so i restricted
| postgres to 640Kb of memory and we got replication working, it
| took about 6 hours of babysitting.
|
| We threw the switch and watched as postgres, with 640Kb of ram
| and a tmpfs backed store proceeded to handle all of the query
| traffic. There were some stored procedures or something that
| were long-querying or whatever - i'm not a DB person at all, so
| we switched back to the regular production server about 8
| minutes later.
|
| Yes, we did it in production.
| winrid wrote:
| Postgres handles low memory situations well. It'll kill
| memory intensive queries before it crashes. I wonder if your
| application was getting a lot of errors back instead of
| successful queries :)
| genewitch wrote:
| the application performed fine, even though we made the
| switch around 15:00 PST. The DBA was concerned because of
| the few long queries.
|
| Obviously the tmpfs was doing the heavy lifting, there -
| and if i had to do a postmortem, i'd wager that filling the
| OS caches was the main reason the long queries took so
| long. We didn't do any sort of performance tracing.
|
| The main purpose was to show that these $35k servers could
| essentially replace the older machines if need be, even
| though the old ones had FusionIO. I just removed the
| middleman of the PCIe bus between the application and the
| memory. It was a near constant argument on the floor about
| whether or not we could feasibly switch to SSDs in some
| configuration over spinning rust or even FusionIO, i wanted
| a third option.
|
| Basically, serve out of registered, ECC memory in front,
| replicate to the fusionIO and let those handle the spindled
| backups, which iirc was a pain point.
| bagels wrote:
| For real, 640 kilobits?
| sigstoat wrote:
| K isn't the abbreviation for kilo, so if you're going to
| rag on the fellow for the 'b', then you should at least be
| asking what a Kelvin*bit is.
| iandinwoodie wrote:
| "The binary meaning of the kilobyte for 1024 bytes
| typically uses the symbol KB, with an uppercase letter
| K." [0]
|
| 0. https://en.m.wikipedia.org/wiki/Kilobyte
| jmiserez wrote:
| 640KiB is very little and I'm wondering if it's a typo,
| given that the servers had 1-2TiB available. Postgres 9.0
| released in 2010 already had 32MiB as the default for
| shared_buffers (with a minimum of 128KiB):
| https://www.postgresql.org/docs/9.0/runtime-config-
| resource.... and 8.1 released in 2005 used 8MB
| (1000*8KiB): https://www.postgresql.org/docs/8.1/runtime-
| config-resource....
| sigstoat wrote:
| i interpreted it as "we wanted to turn the shared buffers
| ~off, but in a hilarious way that would suggest to
| someone reading the configuration file that something was
| going on" (bill gates, mumble mumble)
|
| but, wtf do i know, i'm the crazy guy who tries to
| interpret comments generously.
| genewitch wrote:
| Yes, it was a direct reference to Bill Gates "640
| kilobytes is enough for anyone" and i typed the comment
| right before i fell asleep.
| bagels wrote:
| The question was more about the kilo part, even though I
| didn't clarify. Seems orders of magnitude too small?
| [deleted]
| huijzer wrote:
| To add to that, in 2005, a Cloudflare engineer showed that you
| can receive 1 million packages per second
| (https://blog.cloudflare.com/how-to-receive-a-million-
| packets...). Without processing though.
| kragen wrote:
| I think even a distributed cache in front of a database
| shouldn't have any trouble handling 2000 requests per second.
|
| The issue is not really the number of requests per second,
| probably, but the number of bytes, which they don't talk about
| at all in the article; reading manga with no ads is a pretty
| static kind of application, which could be satisfied amply with
| a web browser or even a much simpler program loading images
| from a filesystem directory.
|
| Valgrind claims httpdito runs a few thousand instructions per
| request, but that's not really accurate; what happens is that
| the kernel is doing all the work. httpdito on Linux can handle
| about 4000 requests per second per core, nearly a million clock
| cycles per request, almost all of which is in the kernel. Of
| course it doesn't ship its logs off to Grafana. In fact, it
| doesn't have logs at all. But it would work fine for reading
| manga.
| antupis wrote:
| Also that 2000 request per second has to happen 24/7 not only
| quick demo session.
| kragen wrote:
| httpdito is nothing if not consistent in its performance.
| It doesn't have enough state to have many ways to perform
| well at first and then more poorly later, or for that
| matter vice versa. (Not saying it couldn't happen, but it's
| not that likely.) Linux is pretty good about consistent
| performance, too, though it has more state.
| buzer wrote:
| > The issue is not really the number of requests per second,
| probably, but the number of bytes, which they don't talk
| about at all in the article; reading manga with no ads is a
| pretty static kind of application, which could be satisfied
| amply with a web browser or even a much simpler program
| loading images from a filesystem directory.
|
| I assume they are talking about their more dynamic content
| serving in this post (for things like search, tracking which
| chapters are read, new chapter listing based on what user
| follows etc.).
|
| They have a custom CDN that is hosted by volunteers to serve
| the images for the manga pages. They provide some metrics for
| that at https://mangadex.network, there are also some older
| screenshots where they hit 3.2GB/s.
| kragen wrote:
| Interesting! Thanks! It still doesn't sound like the kind
| of thing that would require load balancing, but maybe it
| was easier to write it in Python or PHP or something, and
| that made it so heavy that it did.
| hsn915 wrote:
| > distributed cache in front of a database
|
| Already an over kill.
|
| Think smaller. Think simpler.
|
| A single machine serving files directly from the file system
| (yes, from the SSD attached to the machine) will be able to
| handle a _LOT_ more.
| kragen wrote:
| Well, that's what httpdito does: it serves files directly
| from the filesystem. That's why I mentioned it. But, for
| some applications, such as the website you're using right
| now, it's useful to display pages that haven't been
| previously stored in the filesystem.
| snypher wrote:
| >more than 10 million unique monthly visitors
|
| >our ~$1500/month budget
|
| I understand not wanting to show ads, but is there no way for the
| users to contribute to hosting costs?
| [deleted]
| htns wrote:
| They had a bit under $80k in crypto in their list of BTC and
| ETH addresses "leaked" along with the source code when the site
| was hacked earlier this year.
| whateveracct wrote:
| A $5/mo premium plan would break even so quickly
| tommica wrote:
| Premium plan on content that can be considered as dubious in
| copyright context? Seems like a quick way to get shut down.
| hrnn wrote:
| premium plan on a virtual badge, NFT or whatever crap you
| want. content would still be freely available for everybody
| but the infra costs would be a bit less.
| novok wrote:
| Or even a patreon style 'you get nothing but a supporter
| badge' with that kind of traffic levels.
| m45t3r wrote:
| There is the MangaDex@Home, where users can serve part of disk
| space/bandwidth to help serve (mainly old) manga chapters. It
| does need to be something that is running 24/7 (e.g.: not a PC
| that is shutdown frequently), so something like a VPS or a
| service is recommended.
| AviKav wrote:
| Virtually every chapter is served via MD@H now. Client
| doesn't really need much availability, as long as it can do a
| graceful shutdown. Even in the event of a sudden shutdown,
| the trust penalties are much lower than H@H and in practice
| go away after a trickle of traffic to raise your score
| m45t3r wrote:
| Nice, didn't know about this (there isn't much information
| about MD@H after the rewrite).
|
| BTW, how can I register my VPS on MD@H? Before we had a
| dedicated form on the page to register interest, at least
| after the rewrite I didn't find it. Is it only using
| something like Discord?
| KatKafka wrote:
| We had a dedicated page for signing up on the v3 version
| of the site. Currently, yes, it's via our Discord
| server's MD@H channels.
| franciscop wrote:
| From a quick glance it seems to host obviously copyrighted
| content for free. In some jurisdictions (like Spain) the
| companies would have a hard time at court against the website
| creators, since it's a not-for-profit* website sharing culture.
|
| Now show an ad, or premium accounts, and it becomes a for-
| profit endeavour which is straight jail time. I'm unsure about
| donations.
|
| (Based on previous rulings I followed ~10 years ago, laws might
| have changed IANAL yada yada)
|
| *not for profit != non-profit
| reilly3000 wrote:
| They might have a play at being affiliates for sellers of the
| original material. I suppose a link is an ad, but its also
| somehow a less dubious way to monetize in my mind.
| deathanatos wrote:
| It's a nice article, I guess, but the site is down (the one
| discussed in the article, not the blog post itself) for me.
| Semaphor wrote:
| It works fine for me: https://mangadex.org/
| bytearray64 wrote:
| You probably have Verizon - they've started null routeing
| traffic to sites "like this".
|
| https://old.reddit.com/r/mangadex/comments/nvj7qf/is_verizon...
| ApolloFortyNine wrote:
| Ah shucks I thought we had mostly avoided that stuff in the
| US.
|
| I'm guessing though they're using some old spam ip/block
| though, there's a lot more obvious piracy sites then a Manga
| site. For instance, I can access all the major torrent sites.
| bytearray64 wrote:
| They also block nyaa, an Anime/Manga focused tracker. It's
| not a very aggressive list though, as you're right that
| major torrent sites are still accessible.
| deathanatos wrote:
| Huh, right you are, on both accounts, it seems.
|
| That's disappointing. If only I had some choice to ISPs, then
| I could express my disappointment by voting with my wallet...
| bebna wrote:
| Just call their service often enough, and tell them
| internet isnt working.
| urlgrey wrote:
| I'm amazed that their architecture doesn't include a CDN. These
| days I expect nearly all high traffic websites to make use of a
| CDN for all kinds of content, even content that's not cached.
|
| They cited Cloudflare not being used due to privacy concerns.
| It'd be interesting to hear more about that, as well as why other
| CDNs weren't worth evaluating too.
| uyt wrote:
| What they are doing is unfortunately not legal. There were
| precedents of Cloudflare ratting out manga site operators
| before which have led to arrests [1] (the person who ran
| mangamura got a 3 year sentence and a $650k fine [2]). And at
| some point they were going after mangadex via the same way too
| [3].
|
| A lot of their infrastructure design choices should be viewed
| with OPSEC constraints in mind.
|
| [1] https://torrentfreak.com/japan-pirate-site-traffic-
| collapsed...
|
| [2] https://torrentfreak.com/mangamura-operator-handed-three-
| yea...
|
| [3] https://torrentfreak.com/mangadex-targeted-by-dmca-
| subpoena-...
| Sebguer wrote:
| I think the usual argument re: Cloudflare on the privacy front
| is the fact that they pretty aggressively fingerprint users,
| and will downgrade or block traffic originating from VPNs or
| some countries. This is a natural side effect of those things
| often being tied to abusive traffic, and a lot of it is likely
| configurable (at least on their paid plans) but it often comes
| up around this.
| ev1 wrote:
| It's effectively a warez site. There's a reason why they host
| in the places they do and can't be too picky about providers.
|
| CF will also pass through things like DMCAs easily.
|
| Based on their sidebar, it's probably hosted at Ecatel or
| whatever they are called now (cybercrime host) via Epik as a
| reseller, the provider famous for hosting far-right stuff.
| rovr138 wrote:
| What's the reason behind where they host and having issues
| with providers? I haven't heard this before
|
| Regarding DMCA's, as an entity doing business where they're
| legal, what should they do as a middle man?
| ev1 wrote:
| > Regarding DMCA's, as an entity doing business where
| they're legal, what should they do as a middle man?
|
| Don't use them and instead have your middleman be in a
| country that ignores intellectual property rights and
| copyright?
|
| I'm not saying CF is wrong to pass them through. I'm just
| saying CF is not the right choice for a warez site for
| longevity.
| baybal2 wrote:
| Properly tuned NGINX on a physical server can handle
| incomparably large load for static content than some of the
| "cloud" storages around.
|
| The "trick" has really been known for a decade, or more. Have
| as many things static as possible, and only use backend logic
| for the barest minimum.
| sofixa wrote:
| That's the _raison d 'etre_ of nginx, so it is performant for
| this kind of thing. However, the advantage of a CDN is that
| they have points of presence around the world, so your user
| in Singapore doesn't have to do a trip around the world to
| get to your nginx on a physical box in Lisbon.
| bawolff wrote:
| What's the benefit of a cdn if nothing is cacheable? Slightly
| lower latency on the tcp/tls handshake? That seems pretty
| insignificant.
| sofixa wrote:
| In their case (manga), seems like the vast majority of the
| content is cacheable.
| manigandham wrote:
| Latency makes a bigger impact on UX than throughput for
| general browsing. A TLS handshake can be multiple roundtrips
| that greatly benefit from lower latency, especially mobile
| devices.
|
| Modern CDNs also provide lots of functionality from security
| (firewall, DDOS) to application delivery (image optimization,
| partial requests).
| ev1 wrote:
| The CDN part is kind of pointless because they can't really
| have nodes in large parts of the western world since.. it's a
| warez site. The CDN providers will get takedowns, requests to
| reveal the backing origin, etc. You can't use a commodity CDN
| provider for this.
| gaudat wrote:
| They do have a crowdsourced CDN called Mangadex@Home. I
| participated in it from last year until the site was hacked.
| The aggregate egress speed was around 10 Gbps.
|
| The NSFW counterpart of MD also has a CDN appropriately named
| Hentai@Home run by volunteers.
|
| These 2 sites are the only ones rolling their own CDN for free
| that I know.
| mdoms wrote:
| I loaded the front page of Mangadex and it made 114 web requests
| including 10 first-party XHR requests, 30(!!!!) Javascript
| resource requests and somehow 4 font requests, without me
| interacting with the page. Clicking one of the titles on the
| front page resulted in nearly 40 additional requests.
|
| Perhaps if you are limited by requests per second you could
| consider how many requests a single user is making per
| interaction, and if this is a reasonable number.
|
| The website is impressively fast though, I'll give you that.
| watermelon0 wrote:
| Frontend framework they use (NUXT) uses code splitting [1],
| which means that:
|
| - first request is fast, because you only need to download
| chunks required for a single page/controller (and you prefetch
| others in the background)
|
| - changing some parts of codebase requires to re-download only
| affected chunks, instead of the whole bundle
|
| [1] https://www.telerik.com/blogs/what-you-should-know-code-
| spli...
| m45t3r wrote:
| They're probably more limited by bandwidth than requests per
| second, but anyway you look the number of requests are still
| impressive considered the budget.
|
| BTW, the site is not just fast: they serve images on high
| quality (same as the original [1], that can be multiple MBs per
| page [2]) at an pretty impressive speed too.
|
| [1]: before someone asks why they don't optimize the images,
| this is by design since they want to serve high quality images.
| There is an optional toggle to reduce the image size, but this
| is disabled by default.
|
| [2]: for those not familiar, the average number of pages on a
| manga is something like ~20, and this can be read in ~5 minutes
| depending on the density of the text. So you can easily consume
| 50MB+ per chapter.
| [deleted]
| [deleted]
| jhgg wrote:
| They mention a $1,500 budget per month but then omit things
| critical to understanding how they achieve that cost point.
|
| What is actually more interesting is to understand what portion
| is spent on servers versus bandwidth - and what hardware
| configuration they use to host the site. For example, Is
| $1,500/mo just paying for colo costs + bandwidth, with already
| owned recycled hardware (think last gen hardware that you can get
| at steep discounts from eBay / used hardware resellers...)
|
| That would have been way more interesting to know given the blog
| title than the choice of infrastructure software they use.
| chime wrote:
| Not familiar with the project but it is great to see a
| counterpart to over-provisioned enterprise infrastructure. $10 in
| 2021 can do what $100 in 2011 did, what $1000 in 2001 did, and
| that is not solely due to hardware. Well-designed deployments of
| K8s, KVM/LXC, Ceph, LBs like this project can handle so much more
| traffic than poorly configured Wordpress storefronts.
|
| They're using battle-tested tech from Redis and RabbitMQ to
| Ansible and Grafana. Nothing super fancy, nothing used just for
| the sake of being modern. Not sure how long it took them to end
| up with this architecture but it doesn't look like a new dev
| would have a hard time getting familiar with how everything
| works.
|
| Would definitely like to hear more about their dev environment,
| how it is different from prod, and how they handle the
| differences.
| novok wrote:
| I think enterprise and more optimize for business flexibility
| and ability to A/B test very rapidly vs a finely crafted piece
| of efficiency, for better or worse. The people behind this
| probably do this for their day job, or are teens that are about
| to do it for their day job.
| chime wrote:
| I agree with you. I mostly work in enterprise and understand
| that it has different needs and ROI requirements. However, my
| personal mindset is that computers and networks are really
| really fast now and it's a tragedy that most of these gains
| are nullified due to unoptimized layers of abstraction or
| over-architecting. So it's a welcome sight to read about
| well-designed infrastructure like this.
| novok wrote:
| It happens because business are optimizing for resources
| that are ultimately more expensive or slower, which is
| staffing levels and the ability to respond to the market so
| the business can grow or survive longer. Inefficient
| computing architecture as a side effect is a worthwhile
| tradeoff in light of that to them.
|
| But as a craftsman, it is definitely nice :)
| tristan9 wrote:
| > Would definitely like to hear more about their dev
| environment, how it is different from prod, and how they handle
| the differences.
|
| It's honestly quite boringly similar (hence why it's only
| vaguely alluded to in the article)
|
| Take out DDoS-Guard/External LBs (no need for publicness of
| it), pick a cheap-o cloud provider to get niceties like quick
| rebuilding with Terraform etc, slap a VPC-like thing to make it
| a similar private network (do use a different subnet so
| copypasting typos across dev and prod are impossible) and scale
| down everything (ES node has 8 CPUs and 24GB ram in prod? It
| will have to do with 2vCPUs and 2GB RAM in dev)
|
| One of the annoying things is you do want to test the
| replicated/distributed nature of things, so you can't just
| throw everything on a single-instance-single-host because it's
| dev, otherwise you miss out on a lot of the configuration being
| properly tested, which ends up a bit costlier than necessary
| Shadonototro wrote:
| How can this be legal?
|
| It's basically pirating content
| latch wrote:
| I've done things at scale (5-10K req/s) on a budget ($1000 USD)
| and I've done things at much smaller scales that required a much
| larger budget.
|
| _How_ you hit scale on a budget is one part of the equation. The
| other part is: what you're doing.
|
| Off the top of my head, the "how" will often involve the
| following (just to list a few):
|
| 1 - Baremetal
|
| 2 - Cache
|
| 3 - Denormalize
|
| 4 - Append-only
|
| 5 - Shard
|
| 6 - Performance focused clients/api
|
| 7 - Async / background everything
|
| These strategies work _really_ well for catalog-type systems:
| amazon.com, wiki, shopify, spotify, stackoverflow. The list is
| virtually endless.
|
| But it doesn't take much more complexity for it to become more
| difficult/expensive.
|
| Twitter's a good example. Forget twitter-scale, just imagine
| you've outgrown what 1 single DB server can do, how do you scale?
| You can't shard on the `author_id` because the hot path isn't
| "get all my tweets", the hot path is "get all the tweets of the
| people I follow". If you shard on `author_id`, you now need to
| visit N shards. To optimize the hot path, you need to duplicate
| tweets into each "recipient" shard so that you can do: "select
| tweet from tweets where recipient_id = $1 order by created desc
| limit 50". But this duplication is never going to be cheap (to
| compute or store).
|
| (At twitter's scale, even though it's a simple graph, you have
| the case of people with millions of followers which probably need
| special handling. I assume this involves a server-side merge of
| "tweets from normal people" & RAM["tweets from the popular
| people"].)
| winrid wrote:
| I've heard in a few talks how at Twitter engineers have
| accidentally ran into OOM problems by loading up too big of a
| follower graph in memory in application code. I think it's a
| nice reminder that at scale even big companies make the easy
| mistakes and you have to architect for them.
| ignoramous wrote:
| > _Twitter 's a good example._
|
| Mike Cvet's talk about Twitter's fan-in/fan-out problem and its
| solution makes for a fascinating watch: https://www.youtube-
| nocookie.com/embed/WEgCjwyXvwc
| Cipater wrote:
| I appreciate the no-cookie embed.
|
| Learned something new today.
| trampi wrote:
| Reads like a small excerpt out of "Designing Data-Intensive
| Applications" :)
| chairmanwow1 wrote:
| This is an amazing book that improved my effectiveness as an
| engineer by an undefinable amount. Instead of just randomly
| picking components for a cloud application, I learned that I
| could pick the right tools for the job. This book does a
| really good job communicating the trade-offs between
| different designs and tools.
| ignoramous wrote:
| I have always wondered "what next" after having read data-
| intensive. Some suggested looking at research publications
| by Google, Facebook, and Microsoft. What do others
| interested in the field read?
| kalev wrote:
| The 1-7 list you mention definitely deserves it's own blogpost
| and how to implement these. I'm currently not using any of
| these except 1, and probably don't need the rest for a while
| but I do want to know what I should do when I need it. For
| example: what and how should things be cached? When and how to
| denormalize, why is it needed? Why append-only and how? Never
| 'sharded' before, no idea how that works. Heard some things of
| everything async/in the background, but how would that work
| practically?
| BrentOzar wrote:
| > what and how should things be cached?
|
| If something is read much more frequently than it changes,
| store it client-side, or store it temporarily in an in-
| memory-only, not-persisted-to-disk "persistence" layer like
| Redis.
|
| For example, if you're running an online store, your product
| list doesn't change all that often, but it's queried
| constantly. The single source of truth lives in a relational
| database, but when your app needs to fetch the list of
| products, it should first check the caching layer to see if
| it's available there. If not, fetch it from the database, but
| then write it into the cache so that it's available more
| quickly the next time you need it.
|
| > When and how to denormalize, why is it needed?
|
| When you need to join several tables together in order to
| retrieve a result set, and especially when you need to do
| grouping to get the result set, and the retrieval & grouping
| is presenting a performance problem, then pre-bake that data
| on a regular basis, flattening it out into a table optimized
| for read performance.
|
| Again with the online store example, let's say you want to
| show the 10 most popular products, with the average review
| score for each product. As your store grows and you have
| millions of reviews, you don't really want to calculate that
| data every time the web page renders. You would build a
| simpler table that just has the top 10 products, names, IDs,
| average rating, etc. Rendering the page becomes much more
| simple because you can just fetch that list from the table.
| If the average review counts are slightly out of date by a
| day or two, it doesn't really matter.
|
| > Why append-only and how?
|
| If you have a lot of users fighting over the same row, trying
| to update it, you can run into blocking problems. Consider
| just storing new versions of rows.
|
| But now we're starting to get into the much more challenging
| things that require big application code changes - that's why
| the grandparent post listed 'em in this order. If you do the
| first two things I cover above there, you can go a long,
| long, long way.
| toast0 wrote:
| > Never 'sharded' before, no idea how that works.
|
| Sharding sucks, but if your database can't fit on a single
| machine anymore, you do what you've got to do. The basic idea
| is instead of everything in one database on one machine (or
| well redundant group of machines anyway), you have some
| method to decide for a given key what database machine will
| have the data. Managing the split of data across different
| machines is, of course, tricky in practice; especially if you
| need to change the distribution in the future.
|
| OTOH, Supermicro sells dual processor servers that go up to 8
| TB of ram now; you can fit a lot of database in 8 TB of ram,
| and if you don't keep the whole thing in ram, you can index a
| ton of data with 8 TB of ram, which means sharding can wait.
| In contrast, eBay had to shard because a Sun e10k, where they
| ran Oracle, could only go to 64 GB of ram, and they had no
| choice but to break up into multiple databases.
| bigiain wrote:
| > you have some method to decide for a given key what
| database machine will have the data
|
| Super simple example, splitting there phone book into two
| volumes, A-K and L-Z. (Hmmmm, is a "phonebook" a thing that
| typical HN readers remember?)
|
| > you can fit a lot of database in 8 TB of ram, and if you
| don't keep the whole thing in ram, you can index a ton of
| data with 8 TB of ram, which means sharding can wait.
|
| For almost everyone, sharing can wait until after the
| business doesn't need it any more. FAANG need to shard.
| Maybe a few thousand other companies need to shard. I
| suspect way way more businesses start sharding when
| realistically spending more on suitable hardware would
| easily cover the next two orders of magnitude of growth.
|
| One of these boxes maxed out will give you a few TB of ram,
| 24 cpu cores, and 24x16TB NVMe drives which gives you
| 380-ish TB of fairly fast database - for around $135k, and
| you'd want two for redundancy. So maybe 12 months worth of
| a senior engineer's time.
|
| https://www.broadberry.com/performance-storage-
| servers/cyber...
| Zababa wrote:
| > So maybe 12 months worth of a senior engineer's time.
|
| In America. When the salaries are 2/3 times lower, people
| spend more time to use less hardware.
| toast0 wrote:
| Sharding does take more time, but it doesn't save that
| much in hardware costs. Maybe you can save money with two
| 4TB ram servers vs one 8TB ram server, because the
| highest density ram tends to cost more per byte, but you
| also had to buy a whole second system. And that second
| system has follow on costs, now you're using more power,
| and twice the switch ports, etc.
|
| There's also a price breakpoint for single socket vs dual
| socket. Or four vs two, if you really want to spend
| money. My feeling is currently, single socket Epyc looks
| nice if you don't use a ton of ram, but dual socket is
| still decently affordable if you need more cores or more
| ram and probably for Intel sevees; quad socket adds a lot
| of expense and probably isn't worth it.
|
| Of course, if time is cheap and hardware isn't, you can
| spend more time on reducing data size, profiling to find
| optimizations, etc.
| Zababa wrote:
| Fair points, I'm just trying to push back a bit against
| "optimizing anything is useless since the main cost is
| engineering and not hardware", since this situation
| depends on the local salaries and in low-inome countries
| the opposite can be true.
| pedrosorio wrote:
| As a sibling comment mentioned, read DDIA:
| https://dataintensive.net/
| latch wrote:
| It's hard to answer this in general. Most out-of-the-box
| scaling solutions have to be generic, so they lean on
| distribution/clustering (e.g., more than one + coordination)
| so they're expensive.
|
| Consider something like an amazon product page. It's mostly
| static. You can cache the "product", and calculate most of
| the "dynamic" parts in the background periodically (e.g.,
| recommendation, suggestions) and serve it up as static
| content. For the truly dynamic/personalized parts (e.g.,
| previous purchased) you can load this separately (either as a
| separate call from the client or let the server pieces all
| the parts together for the client). This personalized stuff
| is user specific, so [very naively]: conn =
| connections[hash(user_id) % number_of_db_servers]
| conn.row("select last_bought from user_purchases where
| user_id = $1 and product_id = $2", user_id, product_id)
|
| Note that this is also a denormalization compared to:
|
| select max(o.purchase_date) from order o join order_items oi
| on o.id = oi.order_id where o.user_id = $1 and oi.product_id
| = $2
|
| Anyways, I'd start with #7. I'd add RabbitMQ into your stack
| and start using it as a job queue (e.g. send forget
| password). Then I'd expand it to track changes in your data:
| write to "v1.user.create" with the user object in the payload
| (or just user id, both approaches are popular) when a user is
| created. It should let you decouple some of the logic you
| might have that's being executed sequentially on the http
| request, making it easier to test, change and expand. Though
| it does add a lot of operational complexity and stuff that
| can go wrong, so I wouldn't do it unless you need it or want
| to play with it. If nothing else, you'll get more comfortable
| with at-least-once, idempotency and poison messages, which
| are pretty important concepts. (to make the write to the DB
| transactionally safe with the write to the queue, lookup
| "transactional outbox pattern").
| 29athrowaway wrote:
| Try to convert as much content as you can into static content,
| and serve it via CDN. Then, use your servers only for dynamic
| stuff.
|
| Also, put the browser to work for you, caching via Cache-
| Control, ETag, etc. Only then, optimize your server...
| Ginden wrote:
| I would like to notice that many of these techniques can incur
| significant cost of developer or sysadmin time.
| jprupp wrote:
| This is complexity for complexity's sake. Pay no attention to the
| disclaimer at the start of the article. They threw every
| buzzword-heavy bit of tech they could find at it, creating a
| Frankenstein monster.
| sofixa wrote:
| Completely disagree. How would you do it in a simpler way,
| while keeping the features like redundancy ( including
| storage), logs, metrics, etc?
| pahae wrote:
| Looking at their diagrams it seems that the k8s cluster
| exists solely to handle their monitoring and logging needs
| which would be extreme overkill, especially since 18k
| metrics/samples and 7k logs per second are nothing. Plus you
| now suddenly need a S3-compatible storage backend for all
| your logs and metrics. Good thing Ceph comes 'free' with
| Proxmox, I guess.
|
| Deploying an instance of Prometheus with *every host is also
| unusual, to say the least and I don't quite understand their
| comment to that. If you don't like a pull-based architecture
| (which is a valid point) why use one at all!? There are many
| more push-based setups out there that are simpler to set up
| and less complex.
| hsn915 wrote:
| I don't understand. Why is 2k requests/sec supposed to be
| massive?
|
| Try this yourself: write a simple web server in Go, host it on a
| cheap VPS provider, let's say at the option that costs $20/mo.
| Your website will be able to handle more than 1k/s requests with
| hardly any resource usage.
|
| ok, let's assume you're doing some complicated things.
|
| So what? You can scale vertically, upgrade to the $120/mo server.
| Your website now should be able to comfortably handle 5k req/s
|
| Looking at the website itself, mangadex.org, it doesn't even host
| the manga itself. The whole website is just an index that links
| to manga on external websites. All you are doing is storing
| metadata and displaying it as a webpage. The easiest problem on
| the web.
|
| So, I really don't understand the premise behind the whole post.
|
| The problem statement is:
|
| > In practice, we currently see peaks of above 2000 requests
| every single second during prime time.
|
| This is great in terms of success as a website, but it's
| underwhelming in terms of describing a technical problem.
| true_religion wrote:
| They also host the manga. It's not just an link farm. Because
| they host... that's why they use ceph.
|
| Their goal is for scanlators to have a place to post their new
| translated manga, rather than always linking it off from some
| Wordpress instance.
| AviKav wrote:
| The only releases that link to external websites are the ones
| from sites such as MangaPlus and BiliBili (And delayed releases
| if you count those)
| tristan9 wrote:
| > This is great in terms of success as a website, but it's
| underwhelming in terms of describing a technical problem.
|
| A bit of an intro punchline, even though I agree it admittedly
| doesn't say much on itself :)
|
| Fwiw most of the work is that there's little "static" traffic
| going on -- images and cacheable responses are not very CPU
| intensive to serve -- but what isn't static (which is a good
| chunk of it) is more problematic, but more to come on these
| Tenoke wrote:
| >Looking at the website itself, mangadex.org, it doesn't even
| host the manga itself.
|
| They do seem to. Clicking on a random manga on there the images
| are hosted on their server[0]. Also I guess some of those are
| much bigger images which is less trivial to serve at that rate
| than a 10kb static page.
|
| 0.
| blob:https://mangadex.org/e78bd61a-e761-4a73-a27c-5f58394e7ea4
| akx wrote:
| Blob links are scoped to your browser tab, they're not real
| internet URLs.
| slightwinder wrote:
| You ignore the weight of requests and general situation of this
| project. This is not your average mommy-blog whose does not
| care much how many downtimes it has. This is a website with
| illegal content, under constant attack, with a some pretty
| dynamic content on top and likely the main goal to satisfy
| their community. So most of their budget will go to security
| and redundancy, to protect themselves and allowing a high
| uptime.
|
| Where you can use 1 server, they will need to have something
| around 20 servers. Where you can use a cheap VPS provider, they
| must use an expensive shady provider who will take the heat of
| legal attacks. And so on and on... because of their situation
| they have a bunch more requirements which eat their budget than
| your average website, leading to a rather heavy, complex and
| thus expensive architecture.
|
| Surely there is still room for optimization, but it seems this
| is a rather new redesign from scratch(?), so not details need
| time.
| ctvo wrote:
| > Try this yourself: write a simple web server in Go, host it
| on a cheap VPS provider, let's say at the option that costs
| $20/mo. Your website will be able to handle more than 1k/s
| requests with hardly any resource usage.
|
| These people have never heard of Go, obviously. The likely
| scenario is not that you haven't fully understood their
| constraints or requirements, it's that you're just smarter than
| they are.
|
| > So what? You can scale vertically, upgrade to the $120/mo
| server. Your website now should be able to comfortably handle
| 5k req/s
|
| > Looking at the website itself, mangadex.org, it doesn't even
| host the manga itself. The whole website is just an index that
| links to manga on external websites. All you are doing is
| storing metadata and displaying it as a webpage. The easiest
| problem on the web.
|
| Take that order of magnitude cheaper, single VPS server
| solution you're proposing and build something with it. Sounds
| like you'd make a lot of money. There has to be a business idea
| around "storing metadata and displaying it as a webpage"
| somewhere? Easiest problem on the web.
|
| The peanut gallery at HN is out of control. People who don't do
| / build explaining to the people who do how easy, simple,
| better their solutions would be.
| krageon wrote:
| > People who don't do / build
|
| I can and do frequently advise on certain topics in comments
| specifically because I do build and can in fact speak of such
| topics authoritatively. Isn't that what this website is for?
|
| That said, the post you are replying to is perhaps overly
| dismissive of the criteria that this website operates under.
| Other comment chains have some really good advice though.
| manigandham wrote:
| There are plenty of people who build here on HN (more than
| most other sites) and the requirements are pretty clearly
| described in the article.
|
| While it's not as simple as a Go program on a VPS, there is
| certainly a lot of unnecessary overhead here. I think you
| underestimate just how much poor and wasteful engineering
| there is out there.
| ctvo wrote:
| > While it's not as simple as a Go program on a VPS, there
| is certainly a lot of unnecessary overhead here. I think
| you underestimate just how much poor and wasteful
| engineering there is out there.
|
| I don't under estimate poor and wasteful engineering at
| all, but that's not what I saw in the article.
|
| Serving traffic is a single element of their design. They
| also designed for security, redundancy, and observability.
| All with their own solutions because using a service or a
| cloud provider would be too costly. With that in mind, it's
| not a charitable view to think they didn't explore low
| hanging fruits like "make the server in Go". If you think
| you can do better, detail in depth how and solve all of
| their requirements vs. the single piece you're familiar
| with.
|
| And if you can do the above holistically, for an order of
| magnitude below their costs, it sounds like I need to get
| in touch to throw money at you.
| hsn915 wrote:
| If you have a lot of extra money to throw I'd be happy to
| oblige.
| manigandham wrote:
| My background is in adtech, which is a unique mix of
| massive scale, dynamic logic, strict latency
| requirements, and geographical distribution. I've built
| complete ad platforms by myself for 3 different companies
| now so I can confidently say that this is not a difficult
| scenario. It's a ready-heavy content site with very
| little interactivity or complexity to each page and can
| be made much simpler, faster and cheaper.
|
| > " _detail in depth how_ "
|
| This thing seems to be little more than a very complex
| API and SPA sitting on top of Elasticsearch. These
| frontend/backend sites are almost always a poor choice
| compared to a simple server-side framework that just
| generates pages. ES itself is probably unnecessary
| depending on the requirements of their search (it doesn't
| seem to be actual full text indexing of the content but
| just the metadata). The security and observability also
| tends to be a problem of their own making and a symptom
| of too much complexity.
| ctvo wrote:
| > _My background is in adtech, which is a unique mix of
| massive scale, dynamic logic, strict latency
| requirements, and geographical distribution. I 've built
| complete ad platforms by myself for 3 different companies
| now so I can confidently say that this is not a difficult
| scenario. It's a ready-heavy content site with very
| little interactivity or complexity to each page and can
| be made much simpler, faster and cheaper._
|
| I don't dispute this or your credentials. You've built
| critical systems in a space where it was a _core_ of the
| business. If given time, and resources, I have no doubt
| you could build a custom solution to their problem that
| was more efficient.
|
| Unstated in this is the type of business MangaDex is,
| which I have the following assumptions about. I don't
| think it's unfair to assume that we're mostly on the same
| page here:
|
| - Small to mid size, at most
|
| - Small engineering team. Need to develop, deploy,
| support, and maintain solutions.
|
| - Lacks deep systems expertise, or is unable to attract
| talent that has that expertise ($)
|
| These characteristics are very common in our space. To
| solve their technical problems, most of the time, they
| reach for an open source solution (after examining the
| alternatives like a service).
|
| Now the question is given those constraints, and their
| other business requirements, how do they best optimize
| for dimensions they care about? Everything is a trade-
| off. Everyone who builds knows this. It's unkind to
| pretend this is a purely technical exercise. And after
| reading their article, it's obvious they know _some_ of
| trade-offs they 're making, so it's unkind to suggest a
| naive solution that does nothing but make you feel
| smarter. I'm not saying you did the above, but some of
| these comments are outrageous.
| hsn915 wrote:
| > The likely scenario is not that you haven't fully
| understood their constraints or requirements, it's that
| you're just smarter than they are.
|
| I never claimed to be smarter. I just understand some things
| that I noticed a lot of people in the industry don't
| understand.
|
| My understanding is not even that great.
|
| But still, this is just one example that I keep running into
| over and over and over:
|
| People opting for a complicated infrastructure setup because
| that's what they think you should do.
|
| No one showed them how to make a stable reliable website that
| just runs on a single machine and handle thousands of
| concurrent connections.
|
| It's not hard. It's just that they've never seen it and
| assume it's basically impossible.
|
| There are areas about computing that I feel the same way
| about. For example, before Casey Muratori demoed his refterm
| implementation, I had no idea that it was possible to render
| a terminal at thousands of frames per second. I just assumed
| such a feat was technically impossible. Partly because no one
| has done it. But then he did it, and I was blown away.
|
| > Take that order of magnitude cheaper, single VPS server
| solution you're proposing and build something with it. Sounds
| like you'd make a lot of money.
|
| Building something and making money out of it are not the
| same thing. But thanks for the advice. I'm in the process of
| trying. I know for sure I can build the thing, but I don't
| know if it will make any money. We will see.
|
| > People who don't do / build explaining to the people who do
| how easy, simple, better their solutions would be.
|
| I do and have done.
|
| This kind of advice is exactly the kind of thing I know how
| to do because I have done it in the past using my trivial
| setup of a single process running on a cheap VPS. And I have
| also seen other teams struggle to get some feature nearly
| half-working on a complicated infrastructure setup with AWS
| and all the other buzzwords: Kibana, Elastic Search, Dynamo
| DB, Ansible, Terraform, Kubernetes ... what else? I can't
| even keep track of all this stuff that everyone keeps talking
| about even though hardly anyone needs at all.
|
| I've seen 4 or 5 companies try to build their service using
| this kind of setup, with the proposed advantange of
| "horizontal" and "auto" scaling. And you know what? They ALL
| struggled with poor performance, _ALL_ _THE_ _TIME_. It 's
| really sad.
| colesantiago wrote:
| I agree, just had to read the article again, and took it as a
| fancy way of wasting money really.
| [deleted]
| robertwt7 wrote:
| I had nothing but respect for the whole team. Dedicating their
| time to build everything from scratch, not to mention that they
| maintain everything for free.. It's a cool project, not sure if
| there's a way for anyone to contribute.
|
| I"ll join the discord afterwork to see if they need any extra
| hand.
|
| Gee, how do these people find other people online to work on all
| of the cool projects. I would love to join rather than playing
| games after WFH on the same pc over and over again lol
| codewithcheese wrote:
| Find cool project. Contribute. :)
| robertwt7 wrote:
| I do on some open source projects on github. Sorry what I
| meant is not just some open source projects but working
| products like this driven by volunteers / teams like theirs.
| IncRnd wrote:
| Okay, but isn't most of their content stolen? Why would you
| want to contribute to that?
| hwers wrote:
| Just curious if anyone reading this knows the answer: Would
| it be illegal to contribute man-hours on e.g. implementing
| features or fixing bugs on a project like this, or does that
| only apply to whoever actually hosts the content?
| Hamuko wrote:
| MPA tried to get the source code for Nyaa.si removed from
| GitHub because the "Repository hosts and offers for
| download the Project, which, when downloaded, provides the
| downloader everything necessary to launch and host a
| "clone" infringing website identical to Nyaa.si (and, thus,
| engage in massive infringement of copyrighted motion
| pictures and television shows)".
|
| It was a completely retarded play on MPA's part and they
| only managed to get the repo down for days until GitHub
| restored it even without hearing from the repo owners. So
| really they only brought about some minor nuisance
| alongside a bunch of headlines to advertise Nyaa.si for the
| rest of the world.
|
| https://torrentfreak.com/mpa-takes-down-nyaa-github-
| reposito...
|
| https://torrentfreak.com/github-restores-nyaa-repository-
| as-...
| kmeisthax wrote:
| A good lawyer would probably say something like "it
| depends".
|
| It's entirely possible for a copyright owner to construe
| some kind of secondary liability based on your conduct,
| even if the underlying software is legal. This is how they
| ultimately got Grokster, for example - even if the software
| was legal, advertising it's use for copyright infringement
| makes you liable for the infringement. I could also see
| someone alledging contributory liability for, say,
| implementing features of the software that have no non-
| infringing uses. Even if that turned out to ultimately not
| be illegal, that would be at the end of a long, expensive,
| and fruitless legal defense that would drain your finances.
|
| In other words, "chilling effects dominate".
| KingOfCoders wrote:
| Yes of course it is stolen. And people claiming otherwise are
| the same people who come here and ask "What can I do, some
| Chinese company ripped of my website?!?!?!"
| cyborgx7 wrote:
| >And people claiming otherwise are the same people who come
| here and ask "What can I do, some Chinese company ripped of
| my website?!?!?!"
|
| Something you made up in your head with literally not a
| single shred of evidence.
| dinobones wrote:
| Maybe because they enjoy the interesting domain and
| challenges of the area, look at a project like Dolphin for
| example.
|
| Also, some people hold the view that things like information,
| media, code can not be "stolen" in the traditional sense, so
| that further reduces any qualms about associating themselves
| with it.
| cyborgx7 wrote:
| No, intellectual "property" can not be stolen. You are
| thinking of copyright infringement.
| [deleted]
| neonbones wrote:
| The world of scanlations is always on edge. Usually, when
| publishers announce official translations of manga titles,
| fans drop translations of this title. It's not rare that
| publishers hire fans who were translating this title before
| for free as an official team.
|
| To be more precise, the real reason why such sites are alive
| is that they delete titles that got licenses in Europe and
| the USA. Still, publishers can measure the popularity of
| titles and buy legal rights to publish it, because it's
| popular enough. It's harder to find manga "raws" than
| translated versions.
|
| And by that, they're not 100% "illegal" for the western
| world, and asian companies are not so interested in fighting
| with scanlations because they need to combat piracy in their
| part of the world.
| lifthrasiir wrote:
| Heck no. As per the Berne Convention they are 100% illegal
| even in the western world and can only survive due to the
| neglect or lack of legal resources---I have seen multiple
| cases where artists were well aware of scanlations but
| couldn't fight against them because of that. A legal way to
| do scanlation would be always welcomed (and there have been
| varying degree of successes in other areas), but it is just
| wrong to claim that they are somehow legitimate at all.
| Hamuko wrote:
| Depends on what you consider "stolen". In most cases, the
| manga that is available is translated and edited by fans to
| make it accessible to English-speakers when the IP owners do
| not see a reason to do it themselves. The amount of manga
| that actually get official English releases is very tiny and
| western licensing companies do not have many incentives to
| start picking up obscure manga that no one without the
| ability to read Japanese have heard of. They're much better
| off going after manga that have already been made popular by
| fan-translated manga, or have some other property that has
| caught traction (for example manga with an anime adaptation
| that has official or unofficial subtitles).
| IncRnd wrote:
| It's what most of the world considers stolen.
|
| _Scanlations are often viewed by fans as the only way to
| read comics that have not been licensed for release in
| their area. However, according to international copyright
| law, such as the Berne Convention, scanlations are
| illegal._ [1]
|
| This is a snippet about the Berne Convention:
|
| _The Berne Convention for the Protection of Literary and
| Artistic Works, usually known as the Berne Convention, is
| an international agreement governing copyright, which was
| first accepted in Berne, Switzerland, in 1886. The Berne
| Convention has 179 contracting parties, most of which are
| parties to the Paris Act of 1971.
|
| The Berne Convention formally mandated several aspects of
| modern copyright law; it introduced the concept that a
| copyright exists the moment a work is "fixed", rather than
| requiring registration. It also enforces a requirement that
| countries recognize copyrights held by the citizens of all
| other parties to the convention._ [2] [1]
| https://en.wikipedia.org/wiki/Scanlation#Legal_action
| [2] https://en.wikipedia.org/wiki/Berne_Convention
| angarg12 wrote:
| > The only missing bit would be the ability to replicate
| production traffic, as some bugs only happen under very high
| traffic by a large number of concurrent users. This is however at
| best difficult or nearly impossible to do.
|
| Not sure I'm missing something here. Surely you could sample some
| prod traffic and then replay it with one of the many load test
| tools out there. You might lose in the geographical distribution,
| but load testing a web server with 2k TPS sounds a bit trivial.
| holoduke wrote:
| I am running an app with 10.000 incoming rq/s on AVG. It's
| running on 8, 8 core Hetzner VMs. Most request are static data
| calls like images, JSON and text. About 5% is MySQL and other IO
| operations. I pay about 300 euros a month for this setup. Quite
| happy with it.
| TekMol wrote:
| My cheap $20/month VPS serves tens of thousands a user per day
| without breaking much of a sweat. Using a good old LAMP stack
| (Linux, Apache, MariaDB, PHP).
|
| I don't know how many requests per second it can handle.
|
| Trying a guess via curl:
|
| time curl --insecure --header 'Host: www.mysite.com'
| https://127.0.0.1 > test
|
| This gives me 0.03s
|
| So it could handle about 30 requests per second? Or 30x the
| number of CPUs? What do you guys think?
| dharmab wrote:
| Does it serve 20-40 hi resolution images and uploads per user?
| TekMol wrote:
| I wanted to start a discussion about how to estimate the
| number of requests a given server can handle per second. So
| when I read "x requests/s" I can put that into perspective.
|
| But it seems you think I wanted to start a dick measuring
| contest?
|
| If your question is genuine: I would serve images via a CDN.
| The above timing is for assembling a page by doing an auth
| check, a bunch of database queries and templating the result.
| lostmsu wrote:
| I am not sure how to interpret this para:
|
| > In practice, we currently see peaks of above 2000 requests
| every single second during prime time. That is multiple
| billions of requests per month, or more than 10 million
| unique monthly visitors. And all of this before actually
| serving images.
|
| If I am reading that correctly, 2000r/s does not include
| images, and makes it unclear if $1500/month does.
| cinntaile wrote:
| I'm pretty sure that includes images, that's why people
| visit the site. Prime time happens when a very popular
| manga gets released at around the same time every week.
| Hamuko wrote:
| Hosting static files isn't really that hard. I used to host a
| website that at its best served around 1000 GB of video
| content in 24 hours. Of course, it wasn't the fastest without
| a CDN but it was just 25 EUR/month.
| blntechie wrote:
| I guess you basically run a load test of randomized or usage
| weighted list of API endpoints for increasing number of
| synthetic users and see when things start breaking. Many free
| tools help run these tests from even your laptop.
| [deleted]
| rhines wrote:
| You need to do load testing to determine this - a request's
| time includes many delays that are not related to the work the
| server does, and thus it's not as simple as 1/0.03 - it's
| possible that 0.0001 second of that time is actually server
| time, or 0.025 - plus you also have to consider if there are
| multiple cores working, or non-linear algorithms running, or
| who knows what else.
|
| Best way to figure it out is to use an application like Apache
| Bench from a powerful computer with a good internet connection,
| throw a lot of concurrent connections at the site, and see what
| happens.
| TekMol wrote:
| I think it makes sense to test from the server itself because
| otherwise I would test network infrastructure. While that is
| interesting too, I am trying to figure out what the server
| (VM) can handle first.
|
| I just tried Apache Bench:
|
| ab -n 1000 -c 100 'https://www.mysite.com'
| Concurrency Level: 100 Time taken for tests:
| 1.447 seconds Complete requests: 1000
| Failed requests: 0 Requests per second:
| 691.19 [#/sec] (mean) Time per request: 144.679
| [ms] (mean) Time per request: 1.447 [ms] (mean,
| across all concurrent requests)
|
| Wow, that is fast. Around 700 requests per second!
|
| Upping it 10x times to 10k requests ...
| Requests per second: 844.99 [#/sec] (mean)
|
| Even faster!
| lamnk wrote:
| A day is 16 * 60 * 60 = 57,600 seconds (night time
| substracted). So tens thousands users per day is like 1-2
| req/s, maybe 50 at peak time.
|
| What is more important is what kind of requests your server has
| to serve. Nginx can easily serve 50-80k req/s of static
| content; 100ks range if tuned properly.
___________________________________________________________________
(page generated 2021-09-07 23:01 UTC)