[HN Gopher] Searching the web for under $1000/month
___________________________________________________________________
Searching the web for under $1000/month
Author : francoismassot
Score : 394 points
Date : 2021-05-07 10:48 UTC (12 hours ago)
(HTM) web link (quickwit.io)
(TXT) w3m dump (quickwit.io)
| chris_f wrote:
| Nice! Maybe at one point you can release a general web search
| engine for the Common Crawl corpus? It seems even simpler than
| this proof of concept, but potentially more useful for people
| looking for a true full text web search.
|
| There isn't an easy way today to explore or search what is
| contained in the Common Crawl index.
| hansvm wrote:
| > There isn't an easy way today to explore or search what is
| contained in the Common Crawl index.
|
| By that you mean searching the full text contents of their
| crawl, right?
|
| The index is super easy to search nowadays -- in pretty much
| any language you can slap a few lines of code around a get
| request (using range requests [0] if needed), and explore a
| columnar representation of the index [1].
|
| [0]
| https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...
|
| [1] https://commoncrawl.org/2018/03/index-to-warc-files-and-
| urls...
| fulmicoton wrote:
| That's on my to-do list for next week. :)
| capableweb wrote:
| > which is key as each instance issues a lot of parallel requests
| to Amazon S3 and tends to be bound by the network
|
| I wonder if most of the cost comes from S3, EC2 or the "premium"
| bandwidth that Amazon charges ridiculously much for. Since it
| seems to be doing a lot of requests, it wouldn't surprise me if
| it's the network cost, and if so, I wonder why they would even
| use AWS at all then.
| ddorian43 wrote:
| > I wonder if most of the cost comes from S3
|
| This current cost comes from the big dataset of storage in S3.
|
| > it wouldn't surprise me if it's the network cost
|
| Network cost is only outbound. Inside it's free (except multi
| region etc). Ec2 <-> S3 is free bandwidth (you pay for
| requests).
| Grimm1 wrote:
| How are you dealing with the fact common crawl updates their data
| much less regularly than commercial search engines? And that each
| update is only a partial refresh?
|
| Edit: And I will say your site design is very nice.
| francoismassot wrote:
| Thank you! We did not plan to regularly update the index. But
| as it takes only 24 hours to index 1B pages, the easiest way
| would be to reindex everything, upload it to S3 and update the
| metadata so the search engine will query the right segments.
| guilload wrote:
| We indexed Common Crawl only for the purpose of this demo so
| this is one-time thing, we won't deal with updates.
| Grimm1 wrote:
| Ah I understand you're showcasing the methodology for the
| underlying index but you're going to open source the engine.
| I see, great stuff then, super novel and honestly the rest of
| the open source search engines can definitely use some
| competition. Love it!
| sam_lowry_ wrote:
| Why use AWS if you are cost-conscious?
| francoismassot wrote:
| The main reason is that AWS S3 is widely used. We obviously
| want to make it work on HDFS, MinIO and other relevant storage
| systems.
| imhoguy wrote:
| Could this be adapted for IPFS? Anyone with stateless client and
| link to index could search and become part of swarm to speed up
| trendy queries with redundancy.
|
| Then update it with git like diff versioning, utilize IPNS to
| point to HEAD of the latest chain of the index.
| heipei wrote:
| This looks really interesting, I wonder how they will monetize it
| though.
|
| As an aside, projects like these are what keep me wondering
| whether I should switch from cheaper but "dumb" object stores to
| AWS since on AWS you can use your object store together with
| things like Athena etc. and get pay-per-use search / grep and a
| lot of other things, without the egress fees since it's all
| within AWS.
| fulmicoton wrote:
| We really need to make this clear in our next blog post. This
| is not grep here. We are using the same datastructure that are
| used in Elasticsearch or google.
|
| We just adapted them to be object storage friendly. I would not
| call Object Storage dumb by any mean. They are a very powerful
| bottom-up abstraction.
|
| We do manage to get SSD-like throughput from them. The latency
| is the big issue. We had to redesign our search to reduce the
| number of random read in the critical to the bear minimum.
| heipei wrote:
| Appreciate the response. I wasn't trying to say this is grep,
| I fully understand that this is an inverted index which is
| way more interesting to build on top of S3.
|
| I merely wanted to say that by using S3 within AWS you always
| have the fallback option of brute-force "grep" across your
| semi-structured "data lake" or whatever it's called thanks to
| the aggregate bandwidth and Athena.
| fulmicoton wrote:
| Ah my bad! Yes, Humio (and Loki) are opting for this
| approach.
|
| This does decouple compute and storage in a trivial manner.
| There is indeed a realm in which this brute force approach
| is the best approach.
|
| We could probably make a 4D chart with QPS, data size,
| latency, and retention period and define regions where the
| elastic/SOLR approach, Humio, and quickwit are the most
| relevant.
| busymom0 wrote:
| Is this reliant on S3 or can it be used on something like minio
| or digital ocean spaces or backblaze2 too? Backblaze to
| cloudflare data transfers is free so that can reduce costs a lot
| plus B2 is much cheaper than S3.
| fulmicoton wrote:
| It can work on any object storages. I really want to test on
| Minio to see performance fly :)
| phendrenad2 wrote:
| Searching the web is a fool's errand. Google doesn't even search
| the web anymore, they just mind-controlled everyone to submit
| nightly sitemaps to them. Google is more of an index than a
| search engine nowadays.
| ywelsch wrote:
| Interesting! We've built similar support for decoupling compute
| from storage into Elasticsearch and, as coincidence would have
| it, just shared some performance numbers today:
|
| https://www.elastic.co/blog/querying-a-petabyte-of-cloud-sto...
|
| It works just as any regular Elasticsearch index (with full
| Kibana support etc.).
|
| The data being indexed by Lucene allows queries to access index
| structures and return results orders of magnitude faster than
| doing a full table scan.
|
| It is complemented with various caching layers to make repeat
| queries fast.
|
| We expect this new functionality to be used for less frequently
| queried data (e.g. operational or security investigations, legal
| discoveries, or historical performance comparisons on older
| data), trading query speed for cost.
|
| It supports Google Cloud Storage, Azure Blob Storage, Amazon S3
| (+ S3 compatible stores), HDFS, and shared file systems.
| johnghanks wrote:
| This is an ad.
| karterk wrote:
| Cool demo. Searching for phases like "there was a" and "and there
| is" take a really long time. I presume that since the words are
| common, the document IDs mapped to those individual tokens are
| too long as well, so intersections etc. take longer?
| francoismassot wrote:
| Thanks! You are totally right. For the demo, we have even
| banned a few words like "the" because the inverted list
| contains almost all doc ids...
| hu3 wrote:
| Article title is "Searching the web for < $1000 / month".
|
| Despite mentioning Rust once, of course it had to be added to the
| title on HN as "Search 1B pages on AWS S3 for 1000$ / month, made
| in Rust and tantivy".
| snidane wrote:
| Chaos Search seems to be doing this architecture already and
| according to the podcast episode [1], it uses a highly optimized
| storage layout.
|
| Never used it, so would be interested if somebody could comment
| on it.
|
| [1] https://www.dataengineeringpodcast.com/chaos-search-with-
| pet...
| marcinzm wrote:
| Interesting although a 15 second response time on certain queries
| is not a very good user experience.
| cj wrote:
| On the other hand, under 1.5 seconds on common / basic search
| terms is pretty good.
| fulmicoton wrote:
| The poster was referring to the latency of the demo and is
| absolutely correct. The demo can reach 30s on some query.
| Half of it is due to fetch 180k document generation, and half
| of it is a single threaded python code that has nothing to do
| with our product :).
| fulmicoton wrote:
| This demo is not indeed quite misleading.
|
| The high response time is due to the fact that we generate 18k
| snippets to generate the tag cloud. Imagine this is the
| equivalent of clicking on page 1 to 900 on Google!
|
| A "barack obama" phrase query generating 20 snippets runs in
| less than 2seconds on our 2 cheap servers.
|
| I'll set up a "normal 20 results search setting" next week and
| share it an API to show the latency again.
| rossmohax wrote:
| It is a cool project. S3 can be cost efficient, but only if you
| don't touch data :)
|
| Their price calculation doesn't mention cost of S3 requests,
| which very quickly adds up and is often neglected.
|
| It costs $1 for 2.5M GET requests to S3. They have 180 shards, in
| a general case query seems to fetch all of them. Presumably they
| don't download full shard per request, but download an index +
| some relevant ranges. Lets say that is 10 requests per shard. So
| that would be 1800 S3 GET requests = ~1400 search queries cost
| them $1.
|
| Assuming their service is reasonably popular and serve 1
| req/second on average, that would be $1,440 per 30 days in
| addition to advertised $1,000 spent on EC2 and S3 storage.
|
| Seems comparable to AWS ElasticSearch service costs:
|
| - 3 nodes m5.2xlarge.elasticsearch = $1,200
|
| - 20TB EBS storage = $1,638
| fulmicoton wrote:
| I tend to agree :). If we get 1 req/s, even for a dataset of
| that size, this is not as cost efficient.
|
| For that kind of use case, I'd probably start using minio.
|
| > Seems comparable to AWS ElasticSearch service costs: > - 3
| nodes m5.2xlarge.elasticsearch = $1,200 > - 20TB EBS storage =
| $1,638
|
| Don't forget S3 includes replication. Also EBS throughput (even
| with SSD) is not good at all. Also our memory footprint is
| tiny. This is necessary to make it run on two servers.
|
| Finally, cpu-wise, our search engine is almost 2x faster than
| lucene.
|
| If you don't believe us, try to replicate our demo on an
| elastic search :D.
|
| Chatnoir.eu is the only other common crawl cluster we know of.
| It consists of 120 nodes.
| rossmohax wrote:
| > If we get 1 req/s, even for a dataset of that size, this is
| not as cost efficient.
|
| How many req/s do you have in mind for your system to be a
| viable option?
|
| > Also EBS throughput (even with SSD) is not good at all.
|
| It is not worse than S3 still, right?
|
| > Chatnoir.eu is the only other common crawl cluster we know
| of. It consists of 120 nodes.
|
| I have no deep ES experience. Are you saying, that to host
| 6TB of indexed data (before replication) you'd need 120 nodes
| ES cluster? If so, then reducing it to just 2 nodes is the
| real sales pitch, not S3 usage :)
| pcnix wrote:
| Have you checked out the new EBS gp3 disks? Throughout vs
| cost is much better on those than gp2, and also cheaper than
| Provisioned IOPS
| klohto wrote:
| What about d3en instances? Clustered, and together with minio
| you might reach similar performance. Only issue is the inter-
| region traffic, it would need to be inside the same AZ
|
| EDIT: Realizing that d3 has just slow HDD
| fizx wrote:
| It's easy to put a block cache in front of the index, and I'm
| sure they'll get to it sooner or later.
|
| The benefit of using S3 in that case is that unlike e.g.
| Elastic, your block cache servers don't need replication, and
| you can tear them down when you're done. You can put them in a
| true autoscaling group as well.
| bufferoverflow wrote:
| AWS is almost never cost efficient. Maybe if you stay in their
| free tier.
| arcturus17 wrote:
| > AWS is almost never cost efficient.
|
| A ridiculous blanket statement, despite the "almost never"
| cop-out...
|
| It is cost-efficient in a wide array of scenarios. Many
| companies pay for it because they have calculated the
| different investment scenarios and AWS comes on top of
| alternatives such as owning the hardware or using competing
| cloud vendors.
|
| I own a consultancy that builds complex web apps and while I
| appreciate how occasionally a dev has tried to save costs for
| me by cramming every piece of the stack (web server, cache,
| db, queue, etc.), in a single Docker image to host in a
| droplet, I'd much rather pay for separate services, as I
| consider it cheaper in the long run.
| bufferoverflow wrote:
| Name an AWS tier that there's no cheaper alternative for.
| I'm only aware of Glacier, I haven't seen anything cheaper
| than it.
|
| AWS is convenient and reliable, but it's not cheap.
| tjoff wrote:
| Many companies pay for it because they have spent an
| inordinate amount of time learning the ecosystem and know
| of nothing else.
| [deleted]
| heipei wrote:
| For what it's worth, if you want to run ElasticSearch on AWS I
| would always go with local-NVMe instances from the i3 family,
| this is also what AWS and Elasticsearch themselves recommend.
|
| 4x i3en.2xlarge (64GB / 5TB NVMe) at $449 / month (1yr
| reserved) is $1796, or $2636 without reservation, but much
| better performance due to the NVMe drives.
| returningfory2 wrote:
| For Digital Ocean object storage, data transfer to/from a
| Digital Ocean VM is free. You only pay for bytes-at-rest.
|
| But it seems S3 doesn't have a similar offering. Data transfer
| is free between S3 and EC2 instances, but you still pay the
| per-request charge.
|
| I wonder can you factor this into the pricing algorithm.
| rossmohax wrote:
| Obvious optimization would be to cache chunks locally on
| every worker nodes.
| tpetry wrote:
| I had the same feeling when reading the post. Their remark that
| they "estimated the cost" to be that low is in my experience a
| bad signal. Estimating costs on the cloud is really hard, there
| are so many (hidden) costs you may miss making it a lot more
| expensive.
| ykevinator3 wrote:
| What an amazing project, good luck to you guys and thanks for
| sharing.
| fulmicoton wrote:
| Thank you @ykevinator!
| simonw wrote:
| What does your on-S3 storage format look like? Are you storing
| relatively large blobs and doing HTTP Range requests against them
| or are you storing lots of tiny objects and fetching the whole
| object any time you need it?
| guilload wrote:
| What we store on S3 is a regular tantivy index and another tiny
| data structure that we call "turbo index", which makes queries
| faster on object storages. For this demo, the tantivy indexes
| are fairly large and we issue HTTP Range requests against them.
|
| https://github.com/tantivy-search/tantivy
| not2b wrote:
| But are you solving the right problem? This sounds like someone
| has produced a very good and efficient version of AltaVista. Back
| in the 1990s, if you wanted to do classic keyword searches of the
| web, and find all pages that had terms A and B but not C, it
| would give them to you, in a big unsorted pile. The web was still
| small enough that this was sometimes useful, but until Google
| came along with tricks to rank pages that are obvious in
| retrospect, it just wasn't useful for common search terms.
| cardosof wrote:
| Congrats for the project and very cool demo!
|
| One point that may help - I've searched the word fast with
| adjective selected and it didn't show results.
| francoismassot wrote:
| thanks! I guess you had no luck and the server did not respond,
| we have a bunch of errors on the python server and it may come
| from here.
|
| It's working now, you can try it and found that the result is
| "fast and easy": https://common-
| crawl.quickwit.io/?query=fast&partOfSpeech=AD...
| ClumsyPilot wrote:
| Seems like you could build a workstation that runs these quesries
| faster and cheaper than AWS ever could on a RAIDed set of NVME
| drives.
|
| https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-th...
| ProKevinY wrote:
| Brilliant and interesting project by smart people. Kudos. (the
| demo is addictive af)
| ryanworl wrote:
| What are you using for metadata storage?
| fulmicoton wrote:
| There are only 180 splits. For this demo we use a file.
|
| For more serious stuff we use postgresql.
| ryanworl wrote:
| What does the metadata structure look like?
| guilload wrote:
| We store the URI of each shard making up the index and,
| optionally, partition key and value(s). Along with a few
| flags, we also store the shard size, creation and last
| modification time. This additional metadata is not required
| for the query planning phase and is only useful for
| managing the life cycle of the shards and
| debugging/troubleshooting.
| artembugara wrote:
| Francois, Adrien, that's a super nice demo.
|
| Stateless search engine is something new, for sure.
|
| I'd be super interested to see how it evolves over time. We're
| [1] indexing over 1,000,000 news articles per day. We're using
| ElasticSearch to index our data.
|
| Would be interested to see if there's a way to make a cross-demo?
| Let me know.
|
| [1] https://newscatcherapi.com/
| fulmicoton wrote:
| That sounds interesting indeed.
|
| Can you schedule a meeting with me? https://calendly.com/paul-
| quickwit/30min
| artembugara wrote:
| Merci
| [deleted]
| natpat wrote:
| This is super interesting. I've recently also been working on a
| similar concept: we have a reasonable amount (in the terabytes)
| of data, that's fairly static, that I need to search fairly
| infrequently (but sometimes in bulk). A solution we came up with
| was a small , hot, in memory index, that points to the location
| of the data in a file on S3. Random access of a file on S3 is
| pretty fast, and running in an EC2 instance means latency is
| almost nil to S3. Cheap, fast and effective.
|
| We're using some custom Python code to build a Marisa Trie as our
| index. I was wondering if there were alternatives to this set up?
| fulmicoton wrote:
| There might be much better alternative but it really depends on
| the nature of your key.
|
| Because the crux of S3 is the latency you can also decide to
| encode the docs in blocks, and retrieve more data than is
| actually needed.
|
| For this demo, the index from DocID to offset in S3 takes 1.2
| bytes per doc. For a log corpus, we end up with 0.2 bytes per
| doc.
| looklikean wrote:
| Combining data-at-rest with some slim index structure coupled
| with a common access method (like HTTP) was the idea behind a
| tool a key-value store for JSON I once wrote:
| https://github.com/miku/microblob
|
| I first thought of building a custom index structure, but found
| that I did not need everything in memory all the time. Using an
| embedded leveldb works just fine.
| heipei wrote:
| You could look at AWS Athena, especially if you only query
| infrequently and can wait a minute on the search results. There
| are some data layout patterns in your S3 bucket that you can
| use to optimize the search. Then you have true pay-per-use
| querying and don't even have to run any EC2 nodes or code
| yourself.
| gbrits wrote:
| Also check out Dremio with parquet files stored on S3
| thejosh wrote:
| You might want to check out Snowflake for something like this,
| it makes searching pretty easy, especially as it seems your
| data is semi-static? We use it pretty extensively at work and
| it's great.
|
| For your usecase it'll be very cheap if you don't access it
| constantly (you can probably get away with the extra small
| instances, which you are billed per minute).
|
| Not affiliated in anyway, just a suggestion.
| giovannibonetti wrote:
| This is the kind of thing I value in Rails. Active storage [1]
| has been around for a few years and it solves all of this. All
| the metadata you care about is in the database - content type,
| file size, image dimensions, creation date, storage path.
|
| [1] https://guides.rubyonrails.org/active_storage_overview.html
| ddorian43 wrote:
| > that I need to search fairly infrequently (but sometimes in
| bulk).
|
| What do you mean by search ? Full-text-search ? Do you need to
| run custom code on the original data ?
|
| > A solution we came up with was a small , hot, in memory
| index, that points to the location of the data in a file on S3.
|
| Yes, it's like keeping the block-index of a sstable (in
| rocksdb) in-memory. The next step is to have a local cache on
| the ec2 node. And the next step one is to have a "distributed"
| cache on your ec2 nodes, so you don't query S3 for a chunk if
| it's present in any of your other nodes.
|
| Come to think of it, I searched and didn't find a "distributed
| disk cache with optional replication" that can be used in front
| of S3 or whatever dataset. You can use nginx/varnish as a
| reverse-proxy but it doesn't have "distributed". There is
| Alluxio, but it's single-master.
| natpat wrote:
| > What do you mean by search ?
|
| Search maybe is too strong a word - "lookup" is probably more
| correct. I have a couple of identifiers for each document,
| from which I want to retrieve the full doc.
|
| I'm not sure what you mean by running custom code on the
| data. I usually do some kind of transformation afterwards.
|
| I didn't find anything either, which is why I was wondering
| if I was searching for the wrong thing.
| ddorian43 wrote:
| How big is each document ? If documents are big, keep each
| of them as a separate file and store the ids in a database.
| If documents are small, then you want something like
| https://github.com/rockset/rocksdb-cloud for a building
| block
| hungnv wrote:
| > Come to think of it, I searched and didn't find a
| "distributed disk cache with optional replication" that can
| be used in front of S3 or whatever dataset. You can use
| nginx/varnish as a reverse-proxy but it doesn't have
| "distributed". There is Alluxio, but it's single-master.
|
| If you think more about this, it will be like distributed key
| value store with support both disk and memory access. You can
| write one using some opensource Raft libraries, or a possible
| candidate is Tikv from PingCap
| ddorian43 wrote:
| > If you think more about this, it will be like distributed
| key value store with support both disk and memory access.
| You can write one using some opensource Raft libraries, or
| a possible candidate is Tikv from PingCap
|
| My whole point was not building it ;)
|
| There's also https://github.com/NVIDIA/aistore
| jonatron wrote:
| If you're going for low cost, you could do better:
|
| https://www.hetzner.com/dedicated-rootserver/dell/dx181/conf...
|
| Basic configuration in Finland 1 224,91 EUR
|
| 1.92 TB SATA SSD Datacenter Edition 4 95,20 EUR
|
| 320,11 EUR
|
| 320 Euro equals 385.90 United States Dollar
| gallexme wrote:
| Depending on the requirements
| https://www.hetzner.com/dedicated-rootserver/ax101/ May be an
| actually better fit
|
| Once they available again
| blobster wrote:
| The value for money that Hetzner offer is just mind boggling.
| ddorian43 wrote:
| You'll be waiting 1+ month to get the server above.
| jiofih wrote:
| Holy smokes. 8TB SSD + 128GB RAM + Ryzen 9 for 100 euro a
| month.
|
| Can you get anywhere close to this with AWS or even DO?
| fulmicoton wrote:
| That's amazing pricing O_O (drooling)
| heipei wrote:
| The sad story is you can't get anywhere close to this even
| with rented dedicated servers. As a German I'm happy that
| we have Hetzner and I use their services extensively.
| However if I wanted to start deploying things in the US or
| Asia I'd be forced to go with something like OVH which,
| while still a lot cheaper than AWS, is still significantly
| more expensive than Hetzner.
| wongarsu wrote:
| On AWS you can't get 128GB RAM on anything for less than
| $300/month (or nearly $500 on-demand). And to get multiple
| TB of SSD you need significantly larger instances, north of
| $1000/month.
|
| Similar with DO, the closest equivalent is a 3.52GB SSD,
| 128GB RAM, 16 vCPU droplet for $1240/month.
|
| If you need raw power instead of integration into an
| extensive service ecosystem, dedicated servers are hard to
| beat (short of colocating your own hardware, which comes
| with more headache). And Hetzner is among the best in terms
| of value/money.
| kuschku wrote:
| Of course not. But that's why the "cloud" (as in the
| typical DO/AWS/Azure/GCP offerings) are a scam.
| heipei wrote:
| Huge fan of Hetzner, but dedicated servers do not
| invalidate the value proposition of the cloud.
|
| Ordering a server at Hetzner can take anywhere between a
| few minutes and a few days. Each server has a fixed setup
| cost of around the monthly rent. They only have two
| datacenters in Europe. They don't have any auxillary
| services (databases, queues, scalable object storage,
| etc.). They are unbeatable for certain use-cases, but the
| cloud is still valuable for lots of other scenarios.
| ryanlol wrote:
| > They only have two datacenters in Europe
|
| Nonsense, Hetzner operates like 25 datacenters.
| heipei wrote:
| Sorry, let's call it "regions" then, they have multiple
| DCs in different cities in Germany, but for latency
| purposes I would consider these part of one region.
| marcinzm wrote:
| Just because you don't understand the value proposition
| of something doesn't make it a scam.
| Retric wrote:
| AWS is a scam not because it can't save you money, but
| because they actively try to trick you into spending more
| money. That's practically the definition of a scam.
|
| Go to the AWS console and try to answer even simply
| things like how much did the last hour/day/week cost me?
| Or how about some notifications if that new service you
| just added is going to cost vastly more than you where
| expecting.
|
| I know of a few people getting fired after migrating to
| AWS and it's not because the company was suddenly saving
| money.
| marcinzm wrote:
| AWs is pretty bad at telling you how much something
| you're not running will cost if you run it but I've never
| had any issues knowing what something has cost me in the
| past.
|
| >Go to the AWS console and try to answer even simply
| things like how much did the last hour/day/week cost me?
|
| Click user@account in top right, click My Billing
| Dashboard, spend this month is on that page in giant
| font, click Cost Explorer for more granular breakdown
| (day, service, etc.), click Bill Details for list
| breakdown of spend by month.
|
| >Or how about some notifications if that new service you
| just added is going to cost vastly more than you where
| expecting.
|
| Billing Dashboard and then Budgets.
|
| edit: This assumes you have permissions to see billing
| details, by default non-root accounts do not which might
| be why you're confused.
| Retric wrote:
| > Click user@account in top right, click My Billing
| Dashboard, spend this month is on that page in giant
| font, click Cost Explorer for more granular breakdown
| (day, service, etc.), click Bill Details for list
| breakdown of spend by month.
|
| Sure, you see a number but I was just talking with
| someone at AWS who said it you still can't trust it to be
| up to date especially across zone boundaries. That means
| it's useful when everything is working as expected but
| can be actively misleading when troubleshooting.
| whoknew1122 wrote:
| Disclosure: Work at AWS.
|
| I've never seen AWS actively try to trick people into
| spending more money. I've seen Premium Support, product
| service teams, solutions architects, and account managers
| all suggest not to use AWS services if it doesn't fit the
| customer usecase. I've personally recommended non-AWS
| options for customers who are trying to fit a square peg
| into a round hole.
|
| Can the billing console be better? Yes. But AWS isn't
| trying to trick anyone into anything. The console, while
| it has its troubles, doesn't have dark patterns and
| pricing is transparent. You pay for what you use, and
| prices have never decreased.
|
| Hell, I know of a specific service that was priced poorly
| (meaning it wasn't profitable for AWS). Instead of
| raising prices, AWS ate its hat while rewriting the
| entire service from scratch to give it better offerings
| and make it cheaper (both for AWS and customers).
| Retric wrote:
| I haven't used AWS in a while but one trick that I recall
| was enabling service X also enabled sub dependencies.
| Instantly disabling service X didn't stop services XYZ
| which you continued to be billed for. Granted not that
| expensive, but it still felt like a trap.
|
| Other stuff was more debatable, but it just felt like
| dancing in a mine field.
| wongarsu wrote:
| > pricing is transparent
|
| If pricing is intended to be transparent, then why is it
| completely absent from the user interface? Transparent
| pricing would be to tell me how much something costs when
| I order it, not make me use a different tool or find it
| in the documentation
| marcinzm wrote:
| No, no, you're supposed to use their cthulhu inspired
| pricing tool. I mean, you've got at least a 50/50 chance
| of figuring out how to use it before you go permanently
| insane.
| rossmohax wrote:
| Another example of a bit darkish pattern is listing
| ridiculously small prices ($0.0000166667 per GB-second,
| $0.0004 per 1000 GET requests). It's hard to reason about
| very small and very big numbers, order of magnitude
| difference "feels" the same. Showing such a small prices
| is accurate, but deceiving IMHO.
| rossmohax wrote:
| I do not support view that AWS is a scam, but price is
| something AWS tries to make developers not to think
| about. Every blog post, documentation or quick start
| tells you about features, but never about costs.
|
| You read "you can run Lambda in VPC", great, but there is
| a fine print somewhere on a remote page, that you'd also
| need NAT gateway if you want said Lambda to access
| internet, public network wont do.
|
| You read "you can enable SSE on S3", but it is not
| immediately obvious, that every request then incurs KMS
| call and billed accordingly (that was before bucket key
| feture).
|
| Want to enable Control Tower? It creates so many
| services, it is impossible to predict costs until you
| enable it and wait to be billed.
| michaelmrose wrote:
| In order for a system to be effective at achieving a goal
| its owners and operators don't have to sit around a table
| in a smoke filled roam and toast evil. The goal good bad
| or indifferent merely has to be progressively
| incentivized by prevailing conditions.
|
| If clarity causes customers to spend less it is
| disincentivized and since clarity is hard and requires
| active investment to maintain it decays naturally.
|
| It's easy to see how you can end up with a system that
| the users experience as a dishonest attempt to get more
| of their money and operators, who are necessarily very
| familiar with the system experience as merely messy but
| transparent.
|
| Neither is precisely wrong however your users don't have
| your experience or training and many are liable to
| interact with a computer not you. Your system is then
| exactly as honest and transparent as your UI as perceived
| by your average user.
| potiuper wrote:
| Why a Ryzen instead of an Epyc in a data center?
| hansel_der wrote:
| b/c it's a cheap hoster. they use a lot of desktop cpu's.
|
| afaik their most popular product is the EX4x line with a
| i7-6700.
| gallexme wrote:
| Also cause the 5950x is likely for many workloads faster
| which do not linearly scale across more cores than a zen2
| epyc (since zen3 has huge singlethread performance
| improvements)
| onebot wrote:
| I am really starting to feel that co-location will make a big
| comeback. It seems cloud costs are just becoming too high for
| the convenience they once offered. For small projects and scale
| probably makes a ton of sense, but at some point the costs to
| scale aren't worth the up front developer cost savings.
| fulmicoton wrote:
| It depends on the use case does not it.
|
| Shared nothing is the best architecture for e-commerce search
| for instance.
|
| But if you have one query every minutes or so for a 1TB
| dataset, it feels a bit silly to have a couple of servers
| dedicated to it doesn't it? Imagine this is the case for all
| big data search you can think of... Logs, emails, etc. This
| is a waste of CPU and RAM.
| toast0 wrote:
| Bare metal hosting is a happy medium between co-lo and cloud.
| You don't have much control over the network, so it might not
| be enough if you need faster NICs than they offer, but if you
| fit in their offerings, it can work well.
|
| Otoh, the bare metal hoster I worked with is now owned by
| IBM, and a big competitor is owned by private equity; bare
| metal from cloud providers still has a lot of cloudiness
| associated too. Maybe colo is the way to go.
| mwcampbell wrote:
| How about OVH? They now have data centers in Canada and the
| US as well as Europe.
| curryst wrote:
| Where they get you is that it very rarely makes financial
| sense to do both cloud and colo/on-prem (unless you're a
| massive company). It ends up being way more expensive to use
| the cloud, but also hire engineers to work on making an on-
| prem cloud. Most companies have a mixed bag of projects that
| are either better served by the cloud, or are okay with colo
| and the savings it can bring.
|
| Assuming you don't want to do a hybrid approach, then you
| either push everyone onto the cloud and accept paying more,
| or you push everyone into colo and force the small and
| scaling out projects to deal with stuff like having to order
| hardware 3 months in advance.
|
| Then, depending on how nice you want it to be to interact
| with your infrastructure, you can end up paying a lot to have
| people build abstractions over it. Do you want developers to
| be able to create their own database from a merge request or
| API call? If so, now you're going to have to hire someone
| with a 6 figure salary to figure out how to do that. It's
| easy to forget how many things are involved in that. You're
| going to have a lot of databases, so you need a system to
| track them. A lot of these databases are presumably not big
| enough to warrant a full physical server, so you have to sort
| out multi-tenancy. If you have multi-tenancy, you need a way
| to handle RBAC so one user can't bork all the databases on
| the host. You will also need some way to handle what happens
| when one user is throwing so much load at the RDBMS it's
| impacting other apps on that database. To accomplish that,
| you're going to need a way to gather metrics that are sharded
| per-database and a way to monitor those (which is admittedly
| one of the easier bits). You also generally just straight up
| lose a lot of the scaling features. I don't have a way to
| just give you more IOPS to your database on-prem. The best I
| can do is add more disks, but your database will be down for
| a long time if I have to put a disk in, expand the RAID, let
| it redistribute data and then power it back up. That's
| several hours of downtime for you, along with anyone who's on
| the same database. Of course, we can do replicas, and swap
| the master, but everyone will have to reconfigure their apps
| or we need something like Consul to handle that (which means
| more engineers to manage that stuff).
|
| You're also probably going to need more than one of those
| expensive infra people, because they presumably need an on-
| call rotation, and no one is going to agree to be on-call the
| time. And every time someone quits, you have to train the new
| person, which is several months of salary basically wasted.
|
| That's not to say that you don't need infra people on AWS,
| but you a) need a lot less of them, because they only need to
| manage the systems AWS has, not build them, and b) you can
| hire cheaper ops people, again because you don't need people
| that are capable of building those kinds of systems.
|
| Once you factor in all of that stuff, AWS' prices start
| looking more reasonable. They're still a little higher, but
| they're not double the price. If anything more than a tiny,
| tiny subset of the AWS features are appealing, it's going to
| cost you almost as much to build your own as it does to just
| pay Amazon/Google/Microsoft/whoever.
|
| Also, a massive thing people overlook is that AWS is fairly
| well documented. I can Google exactly how to set up
| permissions on an S3 bucket, or how to use an S3 bucket as a
| website. It only takes seconds, the cognitive burden is low,
| and the low-friction doesn't cause anyone stress. In-house
| systems tend to be poorly documented, and doing anything
| slightly outside the norm becomes a "set up a meeting with
| the infra team" kind of thing. It takes forever, but more
| importantly, it takes a lot of thought and it's frustrating.
| ElFitz wrote:
| > Also, a massive thing people overlook is that AWS is
| fairly well documented. I can Google exactly how to set up
| permissions on an S3 bucket, or how to use an S3 bucket as
| a website.
|
| > In-house systems tend to be poorly documented, and doing
| anything slightly outside the norm becomes a "set up a
| meeting with the infra team" kind of thing.
|
| I usually wasn't really happy with AWS' documentation. But
| now, considering the alternative, it find it quite lovely.
| Thank you for making me realize that.
| rossmohax wrote:
| You save on specialized engineers (Database, RabbitMQ, Ceph
| administrators), but you lose elsewhere.
|
| What used to be an apache serving static files, now is S3
| bucket, but it wont be easy, because you wanted your own
| domain, so now you need a Cloudfront because of SSL
| support. Their tutorial conveniently mentions it only at
| the step 7 ("Test your website endpoint").
|
| You buy into Cognito, great, saved money on Keycloak
| administrator, but in the worst moment deep in the project
| you learn that there is absolutely no way to support
| multiple regions, even if you are willing to do some leg
| work for AWS. Or find that Cognito email reset flow can't
| go through your existing customer contact system and must
| go through SES only, suddenly you find developing elaborate
| log/event processing tool just that your customer service
| agent can see password reset event on their interface.
|
| GCP CloudSQL, managed RDBMS, great! No upgrade for you
| other than SQL dump/restore your 10TB instance, have fun.
|
| Cloud might be a net win still, but it is very much not as
| rosy as cloud evangelists want us think.
| [deleted]
| fulmicoton wrote:
| This is a complex misunderstanding...
|
| First, we are getting better throughput from S3 than I we were
| using a SATA SSD. (and slower than a NVMe SSD). This is a bit
| of a secret.
|
| Of course, single sequential throughput on S3 sucks. At the end
| of the day the data is stored on spining disk and we cannot do
| anything against the law physics.
|
| ... but we can concurrently read many disks using s3. Network
| is our only bottleneck. The theoretical upper bound on our
| instances is 2GB/s. On throughput intensive 1s query, we
| observe an average of 1GB/s.
|
| Also you are not accounting for replication. S3 costs include
| battle tested, multi-DC replication.
|
| Last but not least, S3 trivially decouples compute and storage.
| It means that we can host 100 different indices on S3, and use
| the same pool of search server to deal with the CPU-bound
| stuff.
|
| This last bit is really what drives the price an extra 5x down
| for many use case.
| plater wrote:
| "S3 costs include battle tested, multi-DC replication."
|
| Sometimes we pay a bit too much for this multi-replication,
| battle tested stuff. It's not like the probability of loosing
| data is THAT huge. For the 4x extra cost you could easily
| take a backup every 24h.
|
| "It means that we can host 100 different indices on S3, and
| use the same pool of search server to deal with the CPU-bound
| stuff"
|
| You can do that with NFS.
|
| It's amazing how much we are willing to pay for a bunch of
| computers in the cloud. Leasing a new car costs around
| $350/month. You could have three new cars at your disposal
| for the same price as this search implementation.
| curryst wrote:
| > For the 4x extra cost you could easily take a backup
| every 24h.
|
| It's also worth considering the cost to simply regenerate
| the data for something like this that isn't the source of
| truth. You'll lose any content that you indexed that has
| disappeared from the web, but that seems like a feature
| more than a bug.
|
| > You can do that with NFS.
|
| You're going to be bound by your NIC speed. You can bond
| them together, but the upper bounds on NFS performance are
| going to be significantly lower than on S3. Whether that's
| going to be an issue for them or not, I don't know, but a
| big part of the reason for separating compute and storage
| is so that one of them can scale massively without the
| other.
| layla5alive wrote:
| 100Gbps NICs are cheap, relative to the price of the
| cloud...
| 2Gkashmiri wrote:
| How about setting up minio on these hertzner setups? You get
| benefit of s3 on cheap hardware without aws costs
| fulmicoton wrote:
| Absolutely! I want to try that.. We are especially
| interested in testing the latency minio could offer.
| rossmohax wrote:
| I've heard minio metadata handling isn't great,it queries
| all servers. SeaweedFS might give you a better results.
| f430 wrote:
| do you know if they let you host adult videos?
| gallexme wrote:
| yeah sure if u have all the rights to host those videos in
| whole europe/germany its allowed
|
| https://www.hetzner.com/rechtliches/cloud-server/?country=de
|
| i know a couple people have naughty stuff on it (like sex toy
| shops, sexual services, private for sale adult videos)
| f430 wrote:
| hmmm why hetzner though? 100tb offers massive amount of
| bandwidth for tube sites I think.
|
| do you know if Amazon S3 allows adult content?
| visarga wrote:
| Is it a web search engine or an adjective search engine? I'd love
| to see someone make a deep search engine that goes beyond the
| 100...1000 limit.
| fulmicoton wrote:
| It is a web search engine. As explained in the blog post, we
| made the demo by generating 18k snippets and pushing them to a
| NLP pipeline that tries to extract the adjective phrase.
|
| The tech below is an inverted index.
| chrisacky wrote:
| Is there a more recent common crawl data set? 2019 is a long time
| away.
|
| Reason I ask is I'm trying to get all subdomain a of a certain
| domain. So I want a reverse host of unique hostnames under a
| certain domain.
| guilload wrote:
| There are more recent versions of the dataset. We used the
| february/march snapshot from this year and the April snapshot
| just came out
| (https://commoncrawl.org/2021/04/april-2021-crawl-archive-
| now...).
| djdjdjdjdj wrote:
| Hui I wonder why this is not a cost trap. The S3 API request s
| where relative expensive.
| fulmicoton wrote:
| The bandwidth is free if you are in the same region.
| djdjdjdjdj wrote:
| But you pay for requests.
| bambax wrote:
| Very interesting! For some reason I find search engines
| fascinating...
|
| How dependent is this on AWS? Can it be ported to another cloud
| provider?
| fulmicoton wrote:
| We have a storage abstraction that boils down to being able to
| perform Range queries.
|
| Anything that allows us to do range queries is ok.
|
| That includes basically all object storage I know of (Google
| Cloud Storage, Azure Storage, Minio, you name it), but also
| HDFS, or even a remote HTTP2 server.
| 0xbadcafebee wrote:
| > a new breed of full-text search engine
|
| The following is a stupid question, so bare with me.
|
| I have been using search engines for about... 26 years. I have
| attempted to make really crappy databases and search engines. I
| have worked for companies that use search products for internal
| services and customer products. I'm not a _search engineer_ but I
| have a decent understanding of them and their issues, I think.
| And I get why people _want_ full-text search. But is it actually
| a good idea? Should anyone really be using full text search?
|
| I actually work on search products right now. We use Solr as the
| general full text index. We have separate indexes and algorithms
| to make context and semantic inferences, and prioritize results
| based on those, falling back to full text if we don't get
| anything. The full text sucks. The corpus of relationships of
| related concepts is what makes the whole thing useful.
|
| Are we (all) only using full-text because some users are
| demanding that it be there? Or shouldn't we all stop this charade
| of thinking that full-text search of billions of items of data
| will ever be useful to a human being? Even when I show my
| coworkers that I can get something done 10x faster with a curated
| index of content, they _still_ want a search engine that they
| know doesn 't give them the results they want.
|
| Is full-text search the junk food of information retrieval?
___________________________________________________________________
(page generated 2021-05-07 23:00 UTC)