[HN Gopher] Show HN: Ranked Search for Semi-Structured Data
___________________________________________________________________
Show HN: Ranked Search for Semi-Structured Data
We've been working on a search problem that requires querying both
text and numbers simultaneously. For example, in a dataset of
clothing items with descriptions and prices, a search for "slim
pants for $20" should prioritize skinny jeans for $25 over slim
pants for $50 because they are semantically similar and the price
is closer. I've found that standard embedding models struggle with
numerical ordering, while text-to-SQL methods rely on exact matches
and often filter out too many results. To solve this, we built a
system designed specifically for structured datasets like CSVs or
tables. Here's a demo link where you can upload a small CSV to try
out (no login required): https://demo.tryvoker.com. Unlike most
RAG approaches, we process each column independently, handling text
with embeddings and numbers with custom scoring. When a user
submits a query, we parse it into relevant fields--for instance,
extracting "slim pants" as the description and "20" as the price.
We then compute cosine similarity between the description
embeddings and "slim pants" while also calculating the percent
error between the user's price input and the numerical field. These
individual similarity scores are then combined across all columns
to generate a final ranking. Right now, our system works best with
well-structured data, so some preprocessing is often needed. We're
working on improving this by detecting and restructuring messy data
automatically, such as pivoting columns or extracting attributes
from large text fields. We're also adding feedback mechanisms, like
a thumbs up/down system, to refine future search results based on
user input. I'd love to hear about your experiences with similar
search challenges and would appreciate any feedback!
Author : alrudolph
Score : 13 points
Date : 2025-02-27 17:57 UTC (5 hours ago)
(HTM) web link (demo.tryvoker.com)
(TXT) w3m dump (demo.tryvoker.com)
| catlover76 wrote:
| Aren't these the guys with the 6-day work week? If so, I think
| I'll pass on the product lol
| alwa wrote:
| This apparently has been a point of commentary, and the (YC-
| backed) firm is apparently open about it in their job listings
| (to their credit):
|
| https://news.ycombinator.com/item?id=43181601
|
| I feel like we could have done without the snark, but that the
| question of our collective norms with respect to labor
| conditions is legitimate to discuss...
|
| For my part, I've willingly worked 7-day weeks for significant
| periods of time when it felt important to me to do so. I was
| fairly compensated (albeit way less than the $120K Voker is
| offering for the full-stack engineer role in question), the
| work felt meaningful to me, the conditions of the job were
| clear upfront, I didn't have competing demands on my time, and
| it couldn't have gotten done any other way.
|
| At the same time I know salaried colleagues who get dragged in
| 5 days a week for no reason, even when there's nothing to do. I
| imagine there's a balance to be struck in here somewhere...
| sumoboy wrote:
| not the results I expected with the query "80+", "80+ Gold"
| worked better.
| alrudolph wrote:
| Thanks for sharing, what would you have expected the 80+ to
| return?
| tomnipotent wrote:
| Not OP, but at a minimum I'd expect range searches to work.
|
| This particular search pattern could either be a open-ended
| range search "all products 80-infinity" or it could be a
| keyword search "80+" as in "PSU 80+ Gold".
|
| I'm also noticing that many results have mixed-bag prices,
| while I was expecting something closer to being ranked by the
| price delta. For example "psu" returns 5 different $155
| products, but "psu 300" returns a $2000 RTX 4090 along with 4
| $155 PSU's. I don't know if it's a coincidence that the RTX
| has a stock of "304" or if you're applying the ranking method
| to fields outside of price.
| alrudolph wrote:
| Yeah, right now I'm not supporting ranges but can see the
| value and will work on adding in
|
| The demo is setup to match on the description, price and
| stock columns, when you send a message I try to parse it
| into searches for each column. I'm realizing with these
| kinds of searches that I probably have to do more work on
| that parsing, because its not super clear if that number
| should be part of the description search or matching
| against price, etc..
| tomnipotent wrote:
| Numbers are incredibly common in many products - pretty
| much most computer parts (3080, 4090, z270, 13900k).
| Should probably always include them in ranking, but not
| necessarily filtering. Consider "PlayStation 5 games"
| should prioritize PS5 games, but many PS4 games may also
| be eligible. What it should not prioritize are $5
| PlayStation 1 games.
|
| Many of these challenges are obviated through "advanced"
| UI with separate fields for things like min/max price,
| though that's not available in text-only mediums.
___________________________________________________________________
(page generated 2025-02-27 23:01 UTC)