[HN Gopher] Searching the web for under $1000/month
       ___________________________________________________________________
        
       Searching the web for under $1000/month
        
       Author : francoismassot
       Score  : 394 points
       Date   : 2021-05-07 10:48 UTC (12 hours ago)
        
 (HTM) web link (quickwit.io)
 (TXT) w3m dump (quickwit.io)
        
       | chris_f wrote:
       | Nice! Maybe at one point you can release a general web search
       | engine for the Common Crawl corpus? It seems even simpler than
       | this proof of concept, but potentially more useful for people
       | looking for a true full text web search.
       | 
       | There isn't an easy way today to explore or search what is
       | contained in the Common Crawl index.
        
         | hansvm wrote:
         | > There isn't an easy way today to explore or search what is
         | contained in the Common Crawl index.
         | 
         | By that you mean searching the full text contents of their
         | crawl, right?
         | 
         | The index is super easy to search nowadays -- in pretty much
         | any language you can slap a few lines of code around a get
         | request (using range requests [0] if needed), and explore a
         | columnar representation of the index [1].
         | 
         | [0]
         | https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec1...
         | 
         | [1] https://commoncrawl.org/2018/03/index-to-warc-files-and-
         | urls...
        
         | fulmicoton wrote:
         | That's on my to-do list for next week. :)
        
       | capableweb wrote:
       | > which is key as each instance issues a lot of parallel requests
       | to Amazon S3 and tends to be bound by the network
       | 
       | I wonder if most of the cost comes from S3, EC2 or the "premium"
       | bandwidth that Amazon charges ridiculously much for. Since it
       | seems to be doing a lot of requests, it wouldn't surprise me if
       | it's the network cost, and if so, I wonder why they would even
       | use AWS at all then.
        
         | ddorian43 wrote:
         | > I wonder if most of the cost comes from S3
         | 
         | This current cost comes from the big dataset of storage in S3.
         | 
         | > it wouldn't surprise me if it's the network cost
         | 
         | Network cost is only outbound. Inside it's free (except multi
         | region etc). Ec2 <-> S3 is free bandwidth (you pay for
         | requests).
        
       | Grimm1 wrote:
       | How are you dealing with the fact common crawl updates their data
       | much less regularly than commercial search engines? And that each
       | update is only a partial refresh?
       | 
       | Edit: And I will say your site design is very nice.
        
         | francoismassot wrote:
         | Thank you! We did not plan to regularly update the index. But
         | as it takes only 24 hours to index 1B pages, the easiest way
         | would be to reindex everything, upload it to S3 and update the
         | metadata so the search engine will query the right segments.
        
         | guilload wrote:
         | We indexed Common Crawl only for the purpose of this demo so
         | this is one-time thing, we won't deal with updates.
        
           | Grimm1 wrote:
           | Ah I understand you're showcasing the methodology for the
           | underlying index but you're going to open source the engine.
           | I see, great stuff then, super novel and honestly the rest of
           | the open source search engines can definitely use some
           | competition. Love it!
        
       | sam_lowry_ wrote:
       | Why use AWS if you are cost-conscious?
        
         | francoismassot wrote:
         | The main reason is that AWS S3 is widely used. We obviously
         | want to make it work on HDFS, MinIO and other relevant storage
         | systems.
        
       | imhoguy wrote:
       | Could this be adapted for IPFS? Anyone with stateless client and
       | link to index could search and become part of swarm to speed up
       | trendy queries with redundancy.
       | 
       | Then update it with git like diff versioning, utilize IPNS to
       | point to HEAD of the latest chain of the index.
        
       | heipei wrote:
       | This looks really interesting, I wonder how they will monetize it
       | though.
       | 
       | As an aside, projects like these are what keep me wondering
       | whether I should switch from cheaper but "dumb" object stores to
       | AWS since on AWS you can use your object store together with
       | things like Athena etc. and get pay-per-use search / grep and a
       | lot of other things, without the egress fees since it's all
       | within AWS.
        
         | fulmicoton wrote:
         | We really need to make this clear in our next blog post. This
         | is not grep here. We are using the same datastructure that are
         | used in Elasticsearch or google.
         | 
         | We just adapted them to be object storage friendly. I would not
         | call Object Storage dumb by any mean. They are a very powerful
         | bottom-up abstraction.
         | 
         | We do manage to get SSD-like throughput from them. The latency
         | is the big issue. We had to redesign our search to reduce the
         | number of random read in the critical to the bear minimum.
        
           | heipei wrote:
           | Appreciate the response. I wasn't trying to say this is grep,
           | I fully understand that this is an inverted index which is
           | way more interesting to build on top of S3.
           | 
           | I merely wanted to say that by using S3 within AWS you always
           | have the fallback option of brute-force "grep" across your
           | semi-structured "data lake" or whatever it's called thanks to
           | the aggregate bandwidth and Athena.
        
             | fulmicoton wrote:
             | Ah my bad! Yes, Humio (and Loki) are opting for this
             | approach.
             | 
             | This does decouple compute and storage in a trivial manner.
             | There is indeed a realm in which this brute force approach
             | is the best approach.
             | 
             | We could probably make a 4D chart with QPS, data size,
             | latency, and retention period and define regions where the
             | elastic/SOLR approach, Humio, and quickwit are the most
             | relevant.
        
       | busymom0 wrote:
       | Is this reliant on S3 or can it be used on something like minio
       | or digital ocean spaces or backblaze2 too? Backblaze to
       | cloudflare data transfers is free so that can reduce costs a lot
       | plus B2 is much cheaper than S3.
        
         | fulmicoton wrote:
         | It can work on any object storages. I really want to test on
         | Minio to see performance fly :)
        
       | phendrenad2 wrote:
       | Searching the web is a fool's errand. Google doesn't even search
       | the web anymore, they just mind-controlled everyone to submit
       | nightly sitemaps to them. Google is more of an index than a
       | search engine nowadays.
        
       | ywelsch wrote:
       | Interesting! We've built similar support for decoupling compute
       | from storage into Elasticsearch and, as coincidence would have
       | it, just shared some performance numbers today:
       | 
       | https://www.elastic.co/blog/querying-a-petabyte-of-cloud-sto...
       | 
       | It works just as any regular Elasticsearch index (with full
       | Kibana support etc.).
       | 
       | The data being indexed by Lucene allows queries to access index
       | structures and return results orders of magnitude faster than
       | doing a full table scan.
       | 
       | It is complemented with various caching layers to make repeat
       | queries fast.
       | 
       | We expect this new functionality to be used for less frequently
       | queried data (e.g. operational or security investigations, legal
       | discoveries, or historical performance comparisons on older
       | data), trading query speed for cost.
       | 
       | It supports Google Cloud Storage, Azure Blob Storage, Amazon S3
       | (+ S3 compatible stores), HDFS, and shared file systems.
        
       | johnghanks wrote:
       | This is an ad.
        
       | karterk wrote:
       | Cool demo. Searching for phases like "there was a" and "and there
       | is" take a really long time. I presume that since the words are
       | common, the document IDs mapped to those individual tokens are
       | too long as well, so intersections etc. take longer?
        
         | francoismassot wrote:
         | Thanks! You are totally right. For the demo, we have even
         | banned a few words like "the" because the inverted list
         | contains almost all doc ids...
        
       | hu3 wrote:
       | Article title is "Searching the web for < $1000 / month".
       | 
       | Despite mentioning Rust once, of course it had to be added to the
       | title on HN as "Search 1B pages on AWS S3 for 1000$ / month, made
       | in Rust and tantivy".
        
       | snidane wrote:
       | Chaos Search seems to be doing this architecture already and
       | according to the podcast episode [1], it uses a highly optimized
       | storage layout.
       | 
       | Never used it, so would be interested if somebody could comment
       | on it.
       | 
       | [1] https://www.dataengineeringpodcast.com/chaos-search-with-
       | pet...
        
       | marcinzm wrote:
       | Interesting although a 15 second response time on certain queries
       | is not a very good user experience.
        
         | cj wrote:
         | On the other hand, under 1.5 seconds on common / basic search
         | terms is pretty good.
        
           | fulmicoton wrote:
           | The poster was referring to the latency of the demo and is
           | absolutely correct. The demo can reach 30s on some query.
           | Half of it is due to fetch 180k document generation, and half
           | of it is a single threaded python code that has nothing to do
           | with our product :).
        
         | fulmicoton wrote:
         | This demo is not indeed quite misleading.
         | 
         | The high response time is due to the fact that we generate 18k
         | snippets to generate the tag cloud. Imagine this is the
         | equivalent of clicking on page 1 to 900 on Google!
         | 
         | A "barack obama" phrase query generating 20 snippets runs in
         | less than 2seconds on our 2 cheap servers.
         | 
         | I'll set up a "normal 20 results search setting" next week and
         | share it an API to show the latency again.
        
       | rossmohax wrote:
       | It is a cool project. S3 can be cost efficient, but only if you
       | don't touch data :)
       | 
       | Their price calculation doesn't mention cost of S3 requests,
       | which very quickly adds up and is often neglected.
       | 
       | It costs $1 for 2.5M GET requests to S3. They have 180 shards, in
       | a general case query seems to fetch all of them. Presumably they
       | don't download full shard per request, but download an index +
       | some relevant ranges. Lets say that is 10 requests per shard. So
       | that would be 1800 S3 GET requests = ~1400 search queries cost
       | them $1.
       | 
       | Assuming their service is reasonably popular and serve 1
       | req/second on average, that would be $1,440 per 30 days in
       | addition to advertised $1,000 spent on EC2 and S3 storage.
       | 
       | Seems comparable to AWS ElasticSearch service costs:
       | 
       | - 3 nodes m5.2xlarge.elasticsearch = $1,200
       | 
       | - 20TB EBS storage = $1,638
        
         | fulmicoton wrote:
         | I tend to agree :). If we get 1 req/s, even for a dataset of
         | that size, this is not as cost efficient.
         | 
         | For that kind of use case, I'd probably start using minio.
         | 
         | > Seems comparable to AWS ElasticSearch service costs: > - 3
         | nodes m5.2xlarge.elasticsearch = $1,200 > - 20TB EBS storage =
         | $1,638
         | 
         | Don't forget S3 includes replication. Also EBS throughput (even
         | with SSD) is not good at all. Also our memory footprint is
         | tiny. This is necessary to make it run on two servers.
         | 
         | Finally, cpu-wise, our search engine is almost 2x faster than
         | lucene.
         | 
         | If you don't believe us, try to replicate our demo on an
         | elastic search :D.
         | 
         | Chatnoir.eu is the only other common crawl cluster we know of.
         | It consists of 120 nodes.
        
           | rossmohax wrote:
           | > If we get 1 req/s, even for a dataset of that size, this is
           | not as cost efficient.
           | 
           | How many req/s do you have in mind for your system to be a
           | viable option?
           | 
           | > Also EBS throughput (even with SSD) is not good at all.
           | 
           | It is not worse than S3 still, right?
           | 
           | > Chatnoir.eu is the only other common crawl cluster we know
           | of. It consists of 120 nodes.
           | 
           | I have no deep ES experience. Are you saying, that to host
           | 6TB of indexed data (before replication) you'd need 120 nodes
           | ES cluster? If so, then reducing it to just 2 nodes is the
           | real sales pitch, not S3 usage :)
        
           | pcnix wrote:
           | Have you checked out the new EBS gp3 disks? Throughout vs
           | cost is much better on those than gp2, and also cheaper than
           | Provisioned IOPS
        
           | klohto wrote:
           | What about d3en instances? Clustered, and together with minio
           | you might reach similar performance. Only issue is the inter-
           | region traffic, it would need to be inside the same AZ
           | 
           | EDIT: Realizing that d3 has just slow HDD
        
         | fizx wrote:
         | It's easy to put a block cache in front of the index, and I'm
         | sure they'll get to it sooner or later.
         | 
         | The benefit of using S3 in that case is that unlike e.g.
         | Elastic, your block cache servers don't need replication, and
         | you can tear them down when you're done. You can put them in a
         | true autoscaling group as well.
        
         | bufferoverflow wrote:
         | AWS is almost never cost efficient. Maybe if you stay in their
         | free tier.
        
           | arcturus17 wrote:
           | > AWS is almost never cost efficient.
           | 
           | A ridiculous blanket statement, despite the "almost never"
           | cop-out...
           | 
           | It is cost-efficient in a wide array of scenarios. Many
           | companies pay for it because they have calculated the
           | different investment scenarios and AWS comes on top of
           | alternatives such as owning the hardware or using competing
           | cloud vendors.
           | 
           | I own a consultancy that builds complex web apps and while I
           | appreciate how occasionally a dev has tried to save costs for
           | me by cramming every piece of the stack (web server, cache,
           | db, queue, etc.), in a single Docker image to host in a
           | droplet, I'd much rather pay for separate services, as I
           | consider it cheaper in the long run.
        
             | bufferoverflow wrote:
             | Name an AWS tier that there's no cheaper alternative for.
             | I'm only aware of Glacier, I haven't seen anything cheaper
             | than it.
             | 
             | AWS is convenient and reliable, but it's not cheap.
        
             | tjoff wrote:
             | Many companies pay for it because they have spent an
             | inordinate amount of time learning the ecosystem and know
             | of nothing else.
        
           | [deleted]
        
         | heipei wrote:
         | For what it's worth, if you want to run ElasticSearch on AWS I
         | would always go with local-NVMe instances from the i3 family,
         | this is also what AWS and Elasticsearch themselves recommend.
         | 
         | 4x i3en.2xlarge (64GB / 5TB NVMe) at $449 / month (1yr
         | reserved) is $1796, or $2636 without reservation, but much
         | better performance due to the NVMe drives.
        
         | returningfory2 wrote:
         | For Digital Ocean object storage, data transfer to/from a
         | Digital Ocean VM is free. You only pay for bytes-at-rest.
         | 
         | But it seems S3 doesn't have a similar offering. Data transfer
         | is free between S3 and EC2 instances, but you still pay the
         | per-request charge.
         | 
         | I wonder can you factor this into the pricing algorithm.
        
           | rossmohax wrote:
           | Obvious optimization would be to cache chunks locally on
           | every worker nodes.
        
         | tpetry wrote:
         | I had the same feeling when reading the post. Their remark that
         | they "estimated the cost" to be that low is in my experience a
         | bad signal. Estimating costs on the cloud is really hard, there
         | are so many (hidden) costs you may miss making it a lot more
         | expensive.
        
       | ykevinator3 wrote:
       | What an amazing project, good luck to you guys and thanks for
       | sharing.
        
         | fulmicoton wrote:
         | Thank you @ykevinator!
        
       | simonw wrote:
       | What does your on-S3 storage format look like? Are you storing
       | relatively large blobs and doing HTTP Range requests against them
       | or are you storing lots of tiny objects and fetching the whole
       | object any time you need it?
        
         | guilload wrote:
         | What we store on S3 is a regular tantivy index and another tiny
         | data structure that we call "turbo index", which makes queries
         | faster on object storages. For this demo, the tantivy indexes
         | are fairly large and we issue HTTP Range requests against them.
         | 
         | https://github.com/tantivy-search/tantivy
        
       | not2b wrote:
       | But are you solving the right problem? This sounds like someone
       | has produced a very good and efficient version of AltaVista. Back
       | in the 1990s, if you wanted to do classic keyword searches of the
       | web, and find all pages that had terms A and B but not C, it
       | would give them to you, in a big unsorted pile. The web was still
       | small enough that this was sometimes useful, but until Google
       | came along with tricks to rank pages that are obvious in
       | retrospect, it just wasn't useful for common search terms.
        
       | cardosof wrote:
       | Congrats for the project and very cool demo!
       | 
       | One point that may help - I've searched the word fast with
       | adjective selected and it didn't show results.
        
         | francoismassot wrote:
         | thanks! I guess you had no luck and the server did not respond,
         | we have a bunch of errors on the python server and it may come
         | from here.
         | 
         | It's working now, you can try it and found that the result is
         | "fast and easy": https://common-
         | crawl.quickwit.io/?query=fast&partOfSpeech=AD...
        
       | ClumsyPilot wrote:
       | Seems like you could build a workstation that runs these quesries
       | faster and cheaper than AWS ever could on a RAIDed set of NVME
       | drives.
       | 
       | https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-th...
        
       | ProKevinY wrote:
       | Brilliant and interesting project by smart people. Kudos. (the
       | demo is addictive af)
        
       | ryanworl wrote:
       | What are you using for metadata storage?
        
         | fulmicoton wrote:
         | There are only 180 splits. For this demo we use a file.
         | 
         | For more serious stuff we use postgresql.
        
           | ryanworl wrote:
           | What does the metadata structure look like?
        
             | guilload wrote:
             | We store the URI of each shard making up the index and,
             | optionally, partition key and value(s). Along with a few
             | flags, we also store the shard size, creation and last
             | modification time. This additional metadata is not required
             | for the query planning phase and is only useful for
             | managing the life cycle of the shards and
             | debugging/troubleshooting.
        
       | artembugara wrote:
       | Francois, Adrien, that's a super nice demo.
       | 
       | Stateless search engine is something new, for sure.
       | 
       | I'd be super interested to see how it evolves over time. We're
       | [1] indexing over 1,000,000 news articles per day. We're using
       | ElasticSearch to index our data.
       | 
       | Would be interested to see if there's a way to make a cross-demo?
       | Let me know.
       | 
       | [1] https://newscatcherapi.com/
        
         | fulmicoton wrote:
         | That sounds interesting indeed.
         | 
         | Can you schedule a meeting with me? https://calendly.com/paul-
         | quickwit/30min
        
           | artembugara wrote:
           | Merci
        
       | [deleted]
        
       | natpat wrote:
       | This is super interesting. I've recently also been working on a
       | similar concept: we have a reasonable amount (in the terabytes)
       | of data, that's fairly static, that I need to search fairly
       | infrequently (but sometimes in bulk). A solution we came up with
       | was a small , hot, in memory index, that points to the location
       | of the data in a file on S3. Random access of a file on S3 is
       | pretty fast, and running in an EC2 instance means latency is
       | almost nil to S3. Cheap, fast and effective.
       | 
       | We're using some custom Python code to build a Marisa Trie as our
       | index. I was wondering if there were alternatives to this set up?
        
         | fulmicoton wrote:
         | There might be much better alternative but it really depends on
         | the nature of your key.
         | 
         | Because the crux of S3 is the latency you can also decide to
         | encode the docs in blocks, and retrieve more data than is
         | actually needed.
         | 
         | For this demo, the index from DocID to offset in S3 takes 1.2
         | bytes per doc. For a log corpus, we end up with 0.2 bytes per
         | doc.
        
         | looklikean wrote:
         | Combining data-at-rest with some slim index structure coupled
         | with a common access method (like HTTP) was the idea behind a
         | tool a key-value store for JSON I once wrote:
         | https://github.com/miku/microblob
         | 
         | I first thought of building a custom index structure, but found
         | that I did not need everything in memory all the time. Using an
         | embedded leveldb works just fine.
        
         | heipei wrote:
         | You could look at AWS Athena, especially if you only query
         | infrequently and can wait a minute on the search results. There
         | are some data layout patterns in your S3 bucket that you can
         | use to optimize the search. Then you have true pay-per-use
         | querying and don't even have to run any EC2 nodes or code
         | yourself.
        
         | gbrits wrote:
         | Also check out Dremio with parquet files stored on S3
        
         | thejosh wrote:
         | You might want to check out Snowflake for something like this,
         | it makes searching pretty easy, especially as it seems your
         | data is semi-static? We use it pretty extensively at work and
         | it's great.
         | 
         | For your usecase it'll be very cheap if you don't access it
         | constantly (you can probably get away with the extra small
         | instances, which you are billed per minute).
         | 
         | Not affiliated in anyway, just a suggestion.
        
         | giovannibonetti wrote:
         | This is the kind of thing I value in Rails. Active storage [1]
         | has been around for a few years and it solves all of this. All
         | the metadata you care about is in the database - content type,
         | file size, image dimensions, creation date, storage path.
         | 
         | [1] https://guides.rubyonrails.org/active_storage_overview.html
        
         | ddorian43 wrote:
         | > that I need to search fairly infrequently (but sometimes in
         | bulk).
         | 
         | What do you mean by search ? Full-text-search ? Do you need to
         | run custom code on the original data ?
         | 
         | > A solution we came up with was a small , hot, in memory
         | index, that points to the location of the data in a file on S3.
         | 
         | Yes, it's like keeping the block-index of a sstable (in
         | rocksdb) in-memory. The next step is to have a local cache on
         | the ec2 node. And the next step one is to have a "distributed"
         | cache on your ec2 nodes, so you don't query S3 for a chunk if
         | it's present in any of your other nodes.
         | 
         | Come to think of it, I searched and didn't find a "distributed
         | disk cache with optional replication" that can be used in front
         | of S3 or whatever dataset. You can use nginx/varnish as a
         | reverse-proxy but it doesn't have "distributed". There is
         | Alluxio, but it's single-master.
        
           | natpat wrote:
           | > What do you mean by search ?
           | 
           | Search maybe is too strong a word - "lookup" is probably more
           | correct. I have a couple of identifiers for each document,
           | from which I want to retrieve the full doc.
           | 
           | I'm not sure what you mean by running custom code on the
           | data. I usually do some kind of transformation afterwards.
           | 
           | I didn't find anything either, which is why I was wondering
           | if I was searching for the wrong thing.
        
             | ddorian43 wrote:
             | How big is each document ? If documents are big, keep each
             | of them as a separate file and store the ids in a database.
             | If documents are small, then you want something like
             | https://github.com/rockset/rocksdb-cloud for a building
             | block
        
           | hungnv wrote:
           | > Come to think of it, I searched and didn't find a
           | "distributed disk cache with optional replication" that can
           | be used in front of S3 or whatever dataset. You can use
           | nginx/varnish as a reverse-proxy but it doesn't have
           | "distributed". There is Alluxio, but it's single-master.
           | 
           | If you think more about this, it will be like distributed key
           | value store with support both disk and memory access. You can
           | write one using some opensource Raft libraries, or a possible
           | candidate is Tikv from PingCap
        
             | ddorian43 wrote:
             | > If you think more about this, it will be like distributed
             | key value store with support both disk and memory access.
             | You can write one using some opensource Raft libraries, or
             | a possible candidate is Tikv from PingCap
             | 
             | My whole point was not building it ;)
             | 
             | There's also https://github.com/NVIDIA/aistore
        
       | jonatron wrote:
       | If you're going for low cost, you could do better:
       | 
       | https://www.hetzner.com/dedicated-rootserver/dell/dx181/conf...
       | 
       | Basic configuration in Finland 1 224,91 EUR
       | 
       | 1.92 TB SATA SSD Datacenter Edition 4 95,20 EUR
       | 
       | 320,11 EUR
       | 
       | 320 Euro equals 385.90 United States Dollar
        
         | gallexme wrote:
         | Depending on the requirements
         | https://www.hetzner.com/dedicated-rootserver/ax101/ May be an
         | actually better fit
         | 
         | Once they available again
        
           | blobster wrote:
           | The value for money that Hetzner offer is just mind boggling.
        
           | ddorian43 wrote:
           | You'll be waiting 1+ month to get the server above.
        
           | jiofih wrote:
           | Holy smokes. 8TB SSD + 128GB RAM + Ryzen 9 for 100 euro a
           | month.
           | 
           | Can you get anywhere close to this with AWS or even DO?
        
             | fulmicoton wrote:
             | That's amazing pricing O_O (drooling)
        
             | heipei wrote:
             | The sad story is you can't get anywhere close to this even
             | with rented dedicated servers. As a German I'm happy that
             | we have Hetzner and I use their services extensively.
             | However if I wanted to start deploying things in the US or
             | Asia I'd be forced to go with something like OVH which,
             | while still a lot cheaper than AWS, is still significantly
             | more expensive than Hetzner.
        
             | wongarsu wrote:
             | On AWS you can't get 128GB RAM on anything for less than
             | $300/month (or nearly $500 on-demand). And to get multiple
             | TB of SSD you need significantly larger instances, north of
             | $1000/month.
             | 
             | Similar with DO, the closest equivalent is a 3.52GB SSD,
             | 128GB RAM, 16 vCPU droplet for $1240/month.
             | 
             | If you need raw power instead of integration into an
             | extensive service ecosystem, dedicated servers are hard to
             | beat (short of colocating your own hardware, which comes
             | with more headache). And Hetzner is among the best in terms
             | of value/money.
        
             | kuschku wrote:
             | Of course not. But that's why the "cloud" (as in the
             | typical DO/AWS/Azure/GCP offerings) are a scam.
        
               | heipei wrote:
               | Huge fan of Hetzner, but dedicated servers do not
               | invalidate the value proposition of the cloud.
               | 
               | Ordering a server at Hetzner can take anywhere between a
               | few minutes and a few days. Each server has a fixed setup
               | cost of around the monthly rent. They only have two
               | datacenters in Europe. They don't have any auxillary
               | services (databases, queues, scalable object storage,
               | etc.). They are unbeatable for certain use-cases, but the
               | cloud is still valuable for lots of other scenarios.
        
               | ryanlol wrote:
               | > They only have two datacenters in Europe
               | 
               | Nonsense, Hetzner operates like 25 datacenters.
        
               | heipei wrote:
               | Sorry, let's call it "regions" then, they have multiple
               | DCs in different cities in Germany, but for latency
               | purposes I would consider these part of one region.
        
               | marcinzm wrote:
               | Just because you don't understand the value proposition
               | of something doesn't make it a scam.
        
               | Retric wrote:
               | AWS is a scam not because it can't save you money, but
               | because they actively try to trick you into spending more
               | money. That's practically the definition of a scam.
               | 
               | Go to the AWS console and try to answer even simply
               | things like how much did the last hour/day/week cost me?
               | Or how about some notifications if that new service you
               | just added is going to cost vastly more than you where
               | expecting.
               | 
               | I know of a few people getting fired after migrating to
               | AWS and it's not because the company was suddenly saving
               | money.
        
               | marcinzm wrote:
               | AWs is pretty bad at telling you how much something
               | you're not running will cost if you run it but I've never
               | had any issues knowing what something has cost me in the
               | past.
               | 
               | >Go to the AWS console and try to answer even simply
               | things like how much did the last hour/day/week cost me?
               | 
               | Click user@account in top right, click My Billing
               | Dashboard, spend this month is on that page in giant
               | font, click Cost Explorer for more granular breakdown
               | (day, service, etc.), click Bill Details for list
               | breakdown of spend by month.
               | 
               | >Or how about some notifications if that new service you
               | just added is going to cost vastly more than you where
               | expecting.
               | 
               | Billing Dashboard and then Budgets.
               | 
               | edit: This assumes you have permissions to see billing
               | details, by default non-root accounts do not which might
               | be why you're confused.
        
               | Retric wrote:
               | > Click user@account in top right, click My Billing
               | Dashboard, spend this month is on that page in giant
               | font, click Cost Explorer for more granular breakdown
               | (day, service, etc.), click Bill Details for list
               | breakdown of spend by month.
               | 
               | Sure, you see a number but I was just talking with
               | someone at AWS who said it you still can't trust it to be
               | up to date especially across zone boundaries. That means
               | it's useful when everything is working as expected but
               | can be actively misleading when troubleshooting.
        
               | whoknew1122 wrote:
               | Disclosure: Work at AWS.
               | 
               | I've never seen AWS actively try to trick people into
               | spending more money. I've seen Premium Support, product
               | service teams, solutions architects, and account managers
               | all suggest not to use AWS services if it doesn't fit the
               | customer usecase. I've personally recommended non-AWS
               | options for customers who are trying to fit a square peg
               | into a round hole.
               | 
               | Can the billing console be better? Yes. But AWS isn't
               | trying to trick anyone into anything. The console, while
               | it has its troubles, doesn't have dark patterns and
               | pricing is transparent. You pay for what you use, and
               | prices have never decreased.
               | 
               | Hell, I know of a specific service that was priced poorly
               | (meaning it wasn't profitable for AWS). Instead of
               | raising prices, AWS ate its hat while rewriting the
               | entire service from scratch to give it better offerings
               | and make it cheaper (both for AWS and customers).
        
               | Retric wrote:
               | I haven't used AWS in a while but one trick that I recall
               | was enabling service X also enabled sub dependencies.
               | Instantly disabling service X didn't stop services XYZ
               | which you continued to be billed for. Granted not that
               | expensive, but it still felt like a trap.
               | 
               | Other stuff was more debatable, but it just felt like
               | dancing in a mine field.
        
               | wongarsu wrote:
               | > pricing is transparent
               | 
               | If pricing is intended to be transparent, then why is it
               | completely absent from the user interface? Transparent
               | pricing would be to tell me how much something costs when
               | I order it, not make me use a different tool or find it
               | in the documentation
        
               | marcinzm wrote:
               | No, no, you're supposed to use their cthulhu inspired
               | pricing tool. I mean, you've got at least a 50/50 chance
               | of figuring out how to use it before you go permanently
               | insane.
        
               | rossmohax wrote:
               | Another example of a bit darkish pattern is listing
               | ridiculously small prices ($0.0000166667 per GB-second,
               | $0.0004 per 1000 GET requests). It's hard to reason about
               | very small and very big numbers, order of magnitude
               | difference "feels" the same. Showing such a small prices
               | is accurate, but deceiving IMHO.
        
               | rossmohax wrote:
               | I do not support view that AWS is a scam, but price is
               | something AWS tries to make developers not to think
               | about. Every blog post, documentation or quick start
               | tells you about features, but never about costs.
               | 
               | You read "you can run Lambda in VPC", great, but there is
               | a fine print somewhere on a remote page, that you'd also
               | need NAT gateway if you want said Lambda to access
               | internet, public network wont do.
               | 
               | You read "you can enable SSE on S3", but it is not
               | immediately obvious, that every request then incurs KMS
               | call and billed accordingly (that was before bucket key
               | feture).
               | 
               | Want to enable Control Tower? It creates so many
               | services, it is impossible to predict costs until you
               | enable it and wait to be billed.
        
               | michaelmrose wrote:
               | In order for a system to be effective at achieving a goal
               | its owners and operators don't have to sit around a table
               | in a smoke filled roam and toast evil. The goal good bad
               | or indifferent merely has to be progressively
               | incentivized by prevailing conditions.
               | 
               | If clarity causes customers to spend less it is
               | disincentivized and since clarity is hard and requires
               | active investment to maintain it decays naturally.
               | 
               | It's easy to see how you can end up with a system that
               | the users experience as a dishonest attempt to get more
               | of their money and operators, who are necessarily very
               | familiar with the system experience as merely messy but
               | transparent.
               | 
               | Neither is precisely wrong however your users don't have
               | your experience or training and many are liable to
               | interact with a computer not you. Your system is then
               | exactly as honest and transparent as your UI as perceived
               | by your average user.
        
           | potiuper wrote:
           | Why a Ryzen instead of an Epyc in a data center?
        
             | hansel_der wrote:
             | b/c it's a cheap hoster. they use a lot of desktop cpu's.
             | 
             | afaik their most popular product is the EX4x line with a
             | i7-6700.
        
             | gallexme wrote:
             | Also cause the 5950x is likely for many workloads faster
             | which do not linearly scale across more cores than a zen2
             | epyc (since zen3 has huge singlethread performance
             | improvements)
        
         | onebot wrote:
         | I am really starting to feel that co-location will make a big
         | comeback. It seems cloud costs are just becoming too high for
         | the convenience they once offered. For small projects and scale
         | probably makes a ton of sense, but at some point the costs to
         | scale aren't worth the up front developer cost savings.
        
           | fulmicoton wrote:
           | It depends on the use case does not it.
           | 
           | Shared nothing is the best architecture for e-commerce search
           | for instance.
           | 
           | But if you have one query every minutes or so for a 1TB
           | dataset, it feels a bit silly to have a couple of servers
           | dedicated to it doesn't it? Imagine this is the case for all
           | big data search you can think of... Logs, emails, etc. This
           | is a waste of CPU and RAM.
        
           | toast0 wrote:
           | Bare metal hosting is a happy medium between co-lo and cloud.
           | You don't have much control over the network, so it might not
           | be enough if you need faster NICs than they offer, but if you
           | fit in their offerings, it can work well.
           | 
           | Otoh, the bare metal hoster I worked with is now owned by
           | IBM, and a big competitor is owned by private equity; bare
           | metal from cloud providers still has a lot of cloudiness
           | associated too. Maybe colo is the way to go.
        
             | mwcampbell wrote:
             | How about OVH? They now have data centers in Canada and the
             | US as well as Europe.
        
           | curryst wrote:
           | Where they get you is that it very rarely makes financial
           | sense to do both cloud and colo/on-prem (unless you're a
           | massive company). It ends up being way more expensive to use
           | the cloud, but also hire engineers to work on making an on-
           | prem cloud. Most companies have a mixed bag of projects that
           | are either better served by the cloud, or are okay with colo
           | and the savings it can bring.
           | 
           | Assuming you don't want to do a hybrid approach, then you
           | either push everyone onto the cloud and accept paying more,
           | or you push everyone into colo and force the small and
           | scaling out projects to deal with stuff like having to order
           | hardware 3 months in advance.
           | 
           | Then, depending on how nice you want it to be to interact
           | with your infrastructure, you can end up paying a lot to have
           | people build abstractions over it. Do you want developers to
           | be able to create their own database from a merge request or
           | API call? If so, now you're going to have to hire someone
           | with a 6 figure salary to figure out how to do that. It's
           | easy to forget how many things are involved in that. You're
           | going to have a lot of databases, so you need a system to
           | track them. A lot of these databases are presumably not big
           | enough to warrant a full physical server, so you have to sort
           | out multi-tenancy. If you have multi-tenancy, you need a way
           | to handle RBAC so one user can't bork all the databases on
           | the host. You will also need some way to handle what happens
           | when one user is throwing so much load at the RDBMS it's
           | impacting other apps on that database. To accomplish that,
           | you're going to need a way to gather metrics that are sharded
           | per-database and a way to monitor those (which is admittedly
           | one of the easier bits). You also generally just straight up
           | lose a lot of the scaling features. I don't have a way to
           | just give you more IOPS to your database on-prem. The best I
           | can do is add more disks, but your database will be down for
           | a long time if I have to put a disk in, expand the RAID, let
           | it redistribute data and then power it back up. That's
           | several hours of downtime for you, along with anyone who's on
           | the same database. Of course, we can do replicas, and swap
           | the master, but everyone will have to reconfigure their apps
           | or we need something like Consul to handle that (which means
           | more engineers to manage that stuff).
           | 
           | You're also probably going to need more than one of those
           | expensive infra people, because they presumably need an on-
           | call rotation, and no one is going to agree to be on-call the
           | time. And every time someone quits, you have to train the new
           | person, which is several months of salary basically wasted.
           | 
           | That's not to say that you don't need infra people on AWS,
           | but you a) need a lot less of them, because they only need to
           | manage the systems AWS has, not build them, and b) you can
           | hire cheaper ops people, again because you don't need people
           | that are capable of building those kinds of systems.
           | 
           | Once you factor in all of that stuff, AWS' prices start
           | looking more reasonable. They're still a little higher, but
           | they're not double the price. If anything more than a tiny,
           | tiny subset of the AWS features are appealing, it's going to
           | cost you almost as much to build your own as it does to just
           | pay Amazon/Google/Microsoft/whoever.
           | 
           | Also, a massive thing people overlook is that AWS is fairly
           | well documented. I can Google exactly how to set up
           | permissions on an S3 bucket, or how to use an S3 bucket as a
           | website. It only takes seconds, the cognitive burden is low,
           | and the low-friction doesn't cause anyone stress. In-house
           | systems tend to be poorly documented, and doing anything
           | slightly outside the norm becomes a "set up a meeting with
           | the infra team" kind of thing. It takes forever, but more
           | importantly, it takes a lot of thought and it's frustrating.
        
             | ElFitz wrote:
             | > Also, a massive thing people overlook is that AWS is
             | fairly well documented. I can Google exactly how to set up
             | permissions on an S3 bucket, or how to use an S3 bucket as
             | a website.
             | 
             | > In-house systems tend to be poorly documented, and doing
             | anything slightly outside the norm becomes a "set up a
             | meeting with the infra team" kind of thing.
             | 
             | I usually wasn't really happy with AWS' documentation. But
             | now, considering the alternative, it find it quite lovely.
             | Thank you for making me realize that.
        
             | rossmohax wrote:
             | You save on specialized engineers (Database, RabbitMQ, Ceph
             | administrators), but you lose elsewhere.
             | 
             | What used to be an apache serving static files, now is S3
             | bucket, but it wont be easy, because you wanted your own
             | domain, so now you need a Cloudfront because of SSL
             | support. Their tutorial conveniently mentions it only at
             | the step 7 ("Test your website endpoint").
             | 
             | You buy into Cognito, great, saved money on Keycloak
             | administrator, but in the worst moment deep in the project
             | you learn that there is absolutely no way to support
             | multiple regions, even if you are willing to do some leg
             | work for AWS. Or find that Cognito email reset flow can't
             | go through your existing customer contact system and must
             | go through SES only, suddenly you find developing elaborate
             | log/event processing tool just that your customer service
             | agent can see password reset event on their interface.
             | 
             | GCP CloudSQL, managed RDBMS, great! No upgrade for you
             | other than SQL dump/restore your 10TB instance, have fun.
             | 
             | Cloud might be a net win still, but it is very much not as
             | rosy as cloud evangelists want us think.
        
         | [deleted]
        
         | fulmicoton wrote:
         | This is a complex misunderstanding...
         | 
         | First, we are getting better throughput from S3 than I we were
         | using a SATA SSD. (and slower than a NVMe SSD). This is a bit
         | of a secret.
         | 
         | Of course, single sequential throughput on S3 sucks. At the end
         | of the day the data is stored on spining disk and we cannot do
         | anything against the law physics.
         | 
         | ... but we can concurrently read many disks using s3. Network
         | is our only bottleneck. The theoretical upper bound on our
         | instances is 2GB/s. On throughput intensive 1s query, we
         | observe an average of 1GB/s.
         | 
         | Also you are not accounting for replication. S3 costs include
         | battle tested, multi-DC replication.
         | 
         | Last but not least, S3 trivially decouples compute and storage.
         | It means that we can host 100 different indices on S3, and use
         | the same pool of search server to deal with the CPU-bound
         | stuff.
         | 
         | This last bit is really what drives the price an extra 5x down
         | for many use case.
        
           | plater wrote:
           | "S3 costs include battle tested, multi-DC replication."
           | 
           | Sometimes we pay a bit too much for this multi-replication,
           | battle tested stuff. It's not like the probability of loosing
           | data is THAT huge. For the 4x extra cost you could easily
           | take a backup every 24h.
           | 
           | "It means that we can host 100 different indices on S3, and
           | use the same pool of search server to deal with the CPU-bound
           | stuff"
           | 
           | You can do that with NFS.
           | 
           | It's amazing how much we are willing to pay for a bunch of
           | computers in the cloud. Leasing a new car costs around
           | $350/month. You could have three new cars at your disposal
           | for the same price as this search implementation.
        
             | curryst wrote:
             | > For the 4x extra cost you could easily take a backup
             | every 24h.
             | 
             | It's also worth considering the cost to simply regenerate
             | the data for something like this that isn't the source of
             | truth. You'll lose any content that you indexed that has
             | disappeared from the web, but that seems like a feature
             | more than a bug.
             | 
             | > You can do that with NFS.
             | 
             | You're going to be bound by your NIC speed. You can bond
             | them together, but the upper bounds on NFS performance are
             | going to be significantly lower than on S3. Whether that's
             | going to be an issue for them or not, I don't know, but a
             | big part of the reason for separating compute and storage
             | is so that one of them can scale massively without the
             | other.
        
               | layla5alive wrote:
               | 100Gbps NICs are cheap, relative to the price of the
               | cloud...
        
           | 2Gkashmiri wrote:
           | How about setting up minio on these hertzner setups? You get
           | benefit of s3 on cheap hardware without aws costs
        
             | fulmicoton wrote:
             | Absolutely! I want to try that.. We are especially
             | interested in testing the latency minio could offer.
        
               | rossmohax wrote:
               | I've heard minio metadata handling isn't great,it queries
               | all servers. SeaweedFS might give you a better results.
        
         | f430 wrote:
         | do you know if they let you host adult videos?
        
           | gallexme wrote:
           | yeah sure if u have all the rights to host those videos in
           | whole europe/germany its allowed
           | 
           | https://www.hetzner.com/rechtliches/cloud-server/?country=de
           | 
           | i know a couple people have naughty stuff on it (like sex toy
           | shops, sexual services, private for sale adult videos)
        
             | f430 wrote:
             | hmmm why hetzner though? 100tb offers massive amount of
             | bandwidth for tube sites I think.
             | 
             | do you know if Amazon S3 allows adult content?
        
       | visarga wrote:
       | Is it a web search engine or an adjective search engine? I'd love
       | to see someone make a deep search engine that goes beyond the
       | 100...1000 limit.
        
         | fulmicoton wrote:
         | It is a web search engine. As explained in the blog post, we
         | made the demo by generating 18k snippets and pushing them to a
         | NLP pipeline that tries to extract the adjective phrase.
         | 
         | The tech below is an inverted index.
        
       | chrisacky wrote:
       | Is there a more recent common crawl data set? 2019 is a long time
       | away.
       | 
       | Reason I ask is I'm trying to get all subdomain a of a certain
       | domain. So I want a reverse host of unique hostnames under a
       | certain domain.
        
         | guilload wrote:
         | There are more recent versions of the dataset. We used the
         | february/march snapshot from this year and the April snapshot
         | just came out
         | (https://commoncrawl.org/2021/04/april-2021-crawl-archive-
         | now...).
        
       | djdjdjdjdj wrote:
       | Hui I wonder why this is not a cost trap. The S3 API request s
       | where relative expensive.
        
         | fulmicoton wrote:
         | The bandwidth is free if you are in the same region.
        
           | djdjdjdjdj wrote:
           | But you pay for requests.
        
       | bambax wrote:
       | Very interesting! For some reason I find search engines
       | fascinating...
       | 
       | How dependent is this on AWS? Can it be ported to another cloud
       | provider?
        
         | fulmicoton wrote:
         | We have a storage abstraction that boils down to being able to
         | perform Range queries.
         | 
         | Anything that allows us to do range queries is ok.
         | 
         | That includes basically all object storage I know of (Google
         | Cloud Storage, Azure Storage, Minio, you name it), but also
         | HDFS, or even a remote HTTP2 server.
        
       | 0xbadcafebee wrote:
       | > a new breed of full-text search engine
       | 
       | The following is a stupid question, so bare with me.
       | 
       | I have been using search engines for about... 26 years. I have
       | attempted to make really crappy databases and search engines. I
       | have worked for companies that use search products for internal
       | services and customer products. I'm not a _search engineer_ but I
       | have a decent understanding of them and their issues, I think.
       | And I get why people _want_ full-text search. But is it actually
       | a good idea? Should anyone really be using full text search?
       | 
       | I actually work on search products right now. We use Solr as the
       | general full text index. We have separate indexes and algorithms
       | to make context and semantic inferences, and prioritize results
       | based on those, falling back to full text if we don't get
       | anything. The full text sucks. The corpus of relationships of
       | related concepts is what makes the whole thing useful.
       | 
       | Are we (all) only using full-text because some users are
       | demanding that it be there? Or shouldn't we all stop this charade
       | of thinking that full-text search of billions of items of data
       | will ever be useful to a human being? Even when I show my
       | coworkers that I can get something done 10x faster with a curated
       | index of content, they _still_ want a search engine that they
       | know doesn 't give them the results they want.
       | 
       | Is full-text search the junk food of information retrieval?
        
       ___________________________________________________________________
       (page generated 2021-05-07 23:00 UTC)