[HN Gopher] On a great interview question
___________________________________________________________________
On a great interview question
Author : behdad
Score : 36 points
Date : 2023-04-12 19:54 UTC (3 hours ago)
(HTM) web link (behdadesfahbod.medium.com)
(TXT) w3m dump (behdadesfahbod.medium.com)
| texuf wrote:
| I guess I've always been confused by this question. If the
| dictionary isn't infinitely large, and you have plenty of space,
| why can't you put the dictionary in a Set and look up words in
| O(1), reducing the overall complexity to O(n)? Obviously hashing
| has its limitations, but i thought we could hand wave over that
| part. If we're limited by space can we assume that the dictionary
| is ordered and make the lookup in log(n) via a simple binary
| search? Or are we obsessing over the string equality operation? I
| feel like I'm missing something.
| behdad wrote:
| Yes, the string equality itself is o(n).
| elijaht wrote:
| Yea, I dislike it (a bit) for that reason. The "meme" in
| leetcode interviews is that "hashmap lookup is O(1)". It does
| feel like a bit of a gotcha to (correctly, mind you) expect an
| answer of O(n) for the lookup.
|
| That being said, I think you could get around this with careful
| prompting (ie, ask first "what is the Big-O for dictionary
| lookup with the respect to the number of letters" before asking
| the overall big O). But I think there is a bit too much wiggle
| room to make this a completely consistent question, too much
| variance candidate to candidate.
|
| I like the problem a lot, but wouldn't feel comfortable using
| it.
| texuf wrote:
| Because the author didn't bother to explain that this is what
| she's looking for in a well thought out blog post, I assume
| she also didn't explain this is what she was looking for in
| the interview, leaving a lot of candidates who didn't have
| the same educational background floundering.
| Herval_freire wrote:
| The interviewer is not thinking logically. How does he know it's
| a good interview problem? Let's look at the data:
|
| >The fastest I had a candidate solve the entirety of the problem
| with all bells and whistles was in 20 minutes, by a freshman from
| the University of Waterloo, the type who did high-school
| competitions.
|
| >The most depressing failure I saw was a PhD graduate from
| University of Toronto who could not produce working code for the
| first section of the problem in 45 minutes.
|
| >I once had a candidate with 15-years work experience give me
| lots of attitude for asking them such a simple question, while at
| the same time struggling at it.
|
| All of this data points to the fact that this question may not be
| good. A phd graduate and a person with 15 years of experience
| rejected for someone who practices programming for competitions?
| What gets me is that the data is painting an obvious picture
| here. A very obvious picture. An obvious picture that we aren't
| sure what's a good interview and a bad interview question.
|
| But the problem is that most people when looking at this
| completely miss it. It's not obvious to the interviewer and it's
| not obvious to alot of people who like google style questions. We
| literally have not much data and not much science backing any of
| this up.
|
| It's an illustration of how biased humans are and illustration of
| how extra biased interviewing for software positions is. If
| there's anything more unknowingly biased and then the replication
| crisis in science it's technical interviews for companies. There
| needs to be real feedback loops that correlate interview question
| passing with Actual performance.
|
| Google is in a good position to grab this data but I'm not sure
| they are doing so given how they just okayed this guys gut
| decision to go against the grain and use this question. I'm not
| against this question, but certainly to call this great in the
| face of controversial data that he himself gathered and listed on
| his post is just a complete blueprint of the extent of human
| blindness.
|
| The reality of what's going on here is the person here in the
| interview is just getting off on dominating other people with a
| hard question. It's not on purpose but he's doing it without
| realizing it. The blog post in itself is a bit showy. It's like
| "I can answer this question but a phd graduate can't".
| abtinf wrote:
| As a hiring manager, I've noticed most peer hiring managers
| vastly overestimate their ability to hire.
|
| What I've seen happen is that there are two narratives they
| flip on: I hired this great person and they are doing amazing
| so I am amazing; I hired this person and they aren't doing
| great so I fired them so I am amazing.
|
| Of course, they also tend to be terrible at assessing who is
| actually good/bad.
|
| I call it the "arbitrary onion": layer after layer of plausible
| claims that each turn out to be just as arbitrary as the outer
| claim.
| tptacek wrote:
| This is a fun puzzle, but it's not clear to me how this selects
| for people with the aptitude to perform well in any particular
| software development job, as the number of times I've been asked
| to determine is a string is/contains the concatenation of two
| dictionary words is I think zero.
|
| Instead of doing stuff like this, you can just give candidates
| realistic programming challenges and have them write code. You
| don't even need to be in the room with them; after all, you won't
| be when they're actually doing the work.
| abecedarius wrote:
| I actually have had to solve a version of this problem on the
| job. There was a database of tagged photos; for some reason all
| the tag strings had had their spaces stripped. So you had to
| efficiently find a probable chunking into words. (Of course the
| original text wasn't always all dictionary words.)
|
| OTOH re the concern for O(n^2) time when n is bounded by word
| lengths, and when the inner loop was in already-optimized
| string comparisons in your language's hashtable library -- I
| think that part is very artificial.
| adamwk wrote:
| At least in my experience dynamic programming only comes up
| when I'm studying for interviews. Of all algorithms common in
| interview questions, I think DP has the highest ratio for
| occurrence on whiteboards vs occurrence in most software
| development jobs. I understand it's fun when you spot the
| solution but I think it only really proves your candidate did
| their homework
| dekhn wrote:
| I came to the field from a slightly different direction; I
| never studied DP in classes (I was a bio major) but used them
| extensively in my undergraduate research and later when
| implementing a few toy bioinformatics algorithms (it's a real
| joy when it works for the first time!). Even though I've
| written them a number of times, I really doubt I'd ever be
| able to spot a DP problem and solve it with that, without
| some hinting. I do not think it provides useful signal unless
| you're looking for people who will be writing algorithms.
| behdad wrote:
| You are correct in that. A few points though:
|
| - While DP might be rare, memoize is quite common
| alternative. Sticking a @functools.cache in Python for
| example. They are functionally equivalent in many cases,
|
| - I believe knowing their data-structures and algorithms is
| the difference between an okay programmer and a good one,
|
| - When interviewing entry-level software-engineers, there's
| not much else you can ask them except for what they learned
| in class, which is these kinds of questions.
| lmarcos wrote:
| > I believe knowing their data-structures and algorithms is
| the difference between an okay programmer and a good one
|
| For me, the difference between and okay programmer and a
| good one is: how much Linux/Unix they know (processes, core
| system calls, networking, sockets, awk/sed/bash). Whether
| you are writing apis for ecommerce platforms or writing
| custom k8s providers, Linux knowledge is a must. Knowing
| how to implement a trie? I bet most of us have never been
| in that situation.
| bonzini wrote:
| That depends on what kind of system you are developing.
| For example if you're writing APIs for e-commerce
| platforms you may need to know how to tune a container's
| CPU and memory limits but not the difference between
| cgroups v1 and v2.
|
| Likewise, there are jobs where being able to reason on
| complexity and data structures is important.
| adamwk wrote:
| Those are fair points and you also said you didn't expect
| them to implement DP. I'm curious what your minimum
| expectations are for the candidate to solve. When I ask
| these questions it's usually about the journey more than
| the destination (which I believe you mentioned in the
| beginning) but I'm guessing you'd at least expect the
| candidate to come up with the brute force solution.
| lmarcos wrote:
| But even if the job actually requires solving puzzles like that
| one, what's the point of asking the candidate to solve it in 45
| min? Without googling? That's what it's actually totally
| unrealistic.
|
| Give me any problem related to algorithms+ds to be solved in
| 45mins without internet access and someone looking over my
| shoulder and I will give you a poor solution. But give me a day
| and internet connection and I will solve the problem in the
| most efficient/readable/maintainable way possible.
| tptacek wrote:
| It's especially ironic that this puzzle structure selects
| away from candidates whose ordinary method would be to look
| up available algorithms, rather than just going with the best
| answer they came up with off the top of their head.
| bonzini wrote:
| To be honest the first solution is literally one line of
| code. The data structures are those included in Python,
| there's nothing to look up. It's an important skill to take
| some code written by others and understand precisely it's
| performance characteristics.
|
| The interview question is not just about knowing data
| structures and is not about Behdad showing off either. It's
| the starting point for a Socratic discussion.
| behdad wrote:
| That might have worked before ChatGPT. :)
| pcwalton wrote:
| Unfortunately, I just checked and ChatGPT gives the correct
| (slow) answer to your question: def
| is_concatenation_of_two_dict_words(word, dict_words):
| """ Returns True if the input word is a
| concatenation of two words in the input list of dictionary
| words, and False otherwise.
| Args: - word (str): The input string to check.
| - dict_words (list[str]): A list of dictionary words to check
| against. Returns: -
| (bool): True if the input word is a concatenation of two
| words in the input list of dictionary words, and
| False otherwise. """ for i in
| range(1, len(word)): prefix = word[:i]
| suffix = word[i:] if prefix in dict_words and
| suffix in dict_words: return True
| return False
|
| In general, I don't think that LeetCode questions have any
| particular advantage over work sample tests when LLMs are
| involved. Your questions will end up on LeetCode, where LLMs
| will index them and will be able to recite the answers.
| espadrine wrote:
| Bing Chat also gave this answer to part 2:
| def is_concatenation_of_dictionary_words(s, words):
| if not s or not words: return False
| n = len(s) dp = [False] * (n + 1)
| dp[0] = True for i in range(1, n + 1):
| for j in range(i): if dp[j] and s[j:i]
| in words: dp[i] = True
| break return dp[n]
|
| It doesn't find the trie (or regex) solutions to parts 1
| and 2, though. It also finds the right complexity for its
| solution for part 1, but when asked for an O(n) solution,
| it first repeats an equivalent algorithm, and suddenly
| claims it is O(n) because the hash is O(1).
|
| That said, I believe an engineer with the answers it gave,
| would easily figure out from its output what the right
| complexity is. (Figuring out the trie, may not be so easy.)
|
| That said, meanwhile, ChatGPT is not yet at a point where
| it can write out a full working repository solving a
| nontrivial problem. It can help, but it is not autonomous;
| and realistically, if someone can get that help to reach a
| good solution for a test, they can do so for a salary.
| sho_hn wrote:
| > I don't think that LeetCode questions have any particular
| advantage over work sample tests when LLMs are involved.
|
| I agree. We do work sample tests, and in addition to the
| code and docs the candidates hand in, what really matters
| is the walkthroughs we do with them. Why did they do it
| that way? What alternatives did they consider? What are the
| pros and cons? What past projects did they draw on? How did
| they research this?
|
| Candidates usually enjoy this - most programmers enjoy
| talking about a just-finished project, especially when they
| feel good about the result - and you get to learn a lot
| about them.
|
| If someone turned in an LLM-assisted work and lied about
| it, I doubt they'd fare well. And if they did use LLM
| assist - could be an interesting conversation all the same.
| What did you reject? What did you correct? Why?
| Paul-Craft wrote:
| [dead]
| [deleted]
| sfvisser wrote:
| I don't think it's super relevant for the challenge to be
| similar to the day job challenges. Maybe it's even unclear what
| those are at the point of interviewing.
|
| What is important is that the candidate's skills generalize
| well to both the challenge in the interview as well as the day
| job. A question like this might be a good indicator or not
| depending on the job.
|
| If people completely fail at this question it can reveal a lot
| of relevant information.
| sethammons wrote:
| "work sample test"
|
| I've given this same advice for _years_ for designing technical
| interviews. Take a problem that you or your team has actually
| faced in the last n months. Distill it down to remove tribal
| knowledge and get it to something you or a capable colleague
| can hammer out in m minutes. Give the candidate 3m minutes to
| solve it. Typically this means it should take you 15 minutes to
| solve and give them 45 minutes.
|
| Our biggest hit for a question that followed this for an SRE
| type role: we had a cron job that was legacy and buggy and
| caused a couple of issues over 6 months. We went back and
| created a very similar script with very similar bugs, added
| some similar data and re-seeding scripts, and told the
| candidates "customers say that feature $x is not working, and
| that is ran by this script. Here is how to run it, here is the
| user's id that can be found in the data seed." We added some
| other data validation problems. After they fixed it, we asked
| them what they would do to make this more production safe and
| dig in on what they have and have not done for improving
| existing systems. We expected discussions on structured
| logging, metrics, alerting, testing, code layout, etc.
|
| Over time, the roles that you are looking to fill and the
| weaknesses on the existing team that you want to shore up will
| change and you will need different interview questions. Maybe
| more towards distributed systems or more towards testing or
| observability.
| OkayPhysicist wrote:
| I suspect it's less useful as a "this person will excel" test
| and more useful as a "this person won't fail miserably" test.
| Which for most hiring decisions is the more important factor. I
| would under no circumstances hire someone who couldn't at least
| get the brute-force level solutions and talk intelligently
| about performance concerns.
|
| IMO, the disdain from LeetCode from developers vs. its
| continued use disconnect stems from the fact that we developers
| tend to assume that the goal of the hiring process is to find
| the very best possible candidate, whereas the goal in reality
| is to find somebody "good enough" as efficiently as possible.
| activitypea wrote:
| The disdain is fueled by companies' behavior. Your approach
| makes sense, but if that were the case, then a DSA whiteboard
| interview should be a fail/pass -- a company shouldn't prefer
| a candidate that managed to cook up an O(n log n) solution
| over a candidate that only got a solution in O(n2). In my
| experience, no company doing whiteboard interviews works like
| that. As OP themselves said in another comment, "it's not a
| binary yes/no, but gives a range to score the candidate"
|
| At the end of the day, you can't test for negatives and just
| pick a random person that doesn't fail anything. You need to
| have a defined set of values to look for in a candidate, and
| to design an interview that tests for those values. That's
| why I hate DSAs -- for most (all?) companies, they're treated
| as dogma rather than a pragmatic solution to a difficult
| problem.
| lisasays wrote:
| _I suspect it 's less useful as a "this person will excel"
| test and more useful as a "this person won't fail miserably"
| test._
|
| Unfortunately the OP seems to be using it in the former sense
| (when applied to the tries solution that he things will
| separate the "good" candidates from the plodding lunkheads).
| Paul-Craft wrote:
| [dead]
| kolbe wrote:
| "it's less useful as a "this person will excel" test and more
| useful as a "this person won't fail miserably" test."
|
| This is also an unproven statement with dubious validity.
|
| I come from the trading world where the leetcode equivalent
| is a brain teaser. After much deliberation, the only thing we
| could think of for why a brain teaser was useful was that it
| often proved someone wanted the job so badly that they
| memorized all the brain teasers. And that type of dedication
| and drive is what we really wanted.
|
| To me, I don't understand why someone who develops with
| advanced IDEs, using complex tools and libraries, is in the
| least bit judged by putting them in front of what is
| essentially a text editor, and asking them to solve a problem
| that's of no relevance to their job, using no tools that
| they're comfortable with
| agolio wrote:
| > I then push them on this until they realize that any
| implementation of the dictionary lookup function of a string
| would be o(n) at best, and as such the total runtime of the
| solution is O(n2).
|
| Is that true?
|
| I am actually siding towards the candidates on this one
|
| presumably it would be O(n_words_in_dictionary) for the
| dictionary check, which as the number of words in the dictionary
| is constant, would make the overall algorithm O(n) ...
| behdad wrote:
| You still need to query every byte of the input string...
|
| Any better than that you can do would be O(min(n, k)) where k
| is the length of the longest word in the dictionary. But
| without loss of generality that would be O(n).
| light_hue_1 wrote:
| I don't think so.
|
| You take the first n bytes of your input string. You hash
| them. You look them up in the dictionary. The lookup is O(1)
| but hashing your input is O(n).
|
| But there's a simple way around this.
|
| We have incremental hash functions. When you add a byte to
| the string and you use an incremental hash, you do a fixed
| amount of extra work to update the hash. Like that you only
| need to O(1) extra work to compute the new hash.
|
| One trivial incremental hash is to break your input into
| fixed size chunks, hash each chunk, and xor the chunks
| together.
| behdad wrote:
| That's the same as the trie solution.
| abtinf wrote:
| You only need to do that if the length of words is unbound,
| which would be nonsensical.
|
| The maximum number of iterations is length of the largest
| word in the dictionary.
| agolio wrote:
| Good point, though I guess the point of the disagreement is
| then for me that the question's "dictionary words" implies to
| me an english dictionary, rather than an abstract dictionary
| of words of potentially infinite length
| sverona wrote:
| > I then push them on this until they realize that any
| implementation of the dictionary lookup function of a string
| would be o(n) at best, and as such the total runtime of the
| solution is O(n2).
|
| Is n the length of the word or the length of the dictionary? And
| if it's the length of the word (<< the length of the dictionary),
| why does it matter if it's linear or quadratic?
| behdad wrote:
| n is the length of the string. As for why it matters whether
| it's linear or quadratic, we are assessing the candidates
| ability to analyze a problem and possibly improve an algorithm.
| It might not matter in the problem at hand.
| sverona wrote:
| Okay, yeah, that's fair.
| edflsafoiewq wrote:
| Surely there is a longest word in the dictionary, so you can
| short circuit for any string longer than twice that. So an
| asymptotic analysis in the length of the string doesn't
| really make sense.
| Ensorceled wrote:
| A question I used for a number of years was a simple
| "compression" algorithm that would turn a string of lower case
| letters into a shorter string by length encoding using the
| capital equivalent:
|
| abbbbbcdddddde => aB5cD6e
|
| The number of developers who struggled or utterly failed with
| this was depressing.
| returningfory2 wrote:
| This is very interesting, thank you for sharing! I do have one
| tangential comment based on this:
|
| > I continued to use [the interview question] because I got good
| signal from the candidates.
|
| Working at Google, this is something I hear a lot: that some
| question, or other interview thing, is good for getting "signal"
| from candidates. But how do you actually know?
|
| Like, I feel to definitely know a question is good you would
| actually need to track candidates after they start working and
| see if their performance on the question correlates with their
| job performance. (Unfortunately, it seems impossible to track
| candidates who don't get an offer.)
| behdad wrote:
| Thanks for the comment. By saying "I get good signal" I mean
| that it's not a binary yes/no, but gives a range to score the
| candidate. You are right about actual correlation to job
| performance. I know the HR at Google performed correlation
| analysis between interview scores and job performance and made
| internal observations, which as you note is biased because only
| includes hired candidates.
| gwbas1c wrote:
| I find that "good signal" is understanding a candidate's
| thought process and ability to intuit general programming
| concepts.
|
| These are hard to do if you think of an interview question like
| a high school quiz or a college final.
| abtinf wrote:
| There are a finite number of words in the dictionary, which
| implies there is an upper bound in the length of words.
|
| The maximum number of iterations is the length of the longest
| word in the dictionary.
|
| This means that part 1 of the problem can be solved in O(1)+O(1).
|
| That doesn't mean you will be happy with it. Not all O(1)s are
| equal.
| hhmc wrote:
| Similarly you can precompute a dictionary for the Cartesian
| product of the dictionary with itself, then lookup into that.
|
| Big in space but O(1) (wrt word len at least)
| abtinf wrote:
| Excellent point.
|
| The only way to evaluate if this is a good or bad solution is
| the actual operational context.
|
| If you had a service that had to do millions of these matches
| per second with low latency, then this might be a reasonable
| solution.
| languagehacker wrote:
| I'm a big fan of trie-based interview questions because if an
| individual can call them out immediately, they know their data
| structures -- awesome.
|
| If they only know simple data structures, going from a hash to a
| hash of hashes is one of those "eureka" moments that can really
| shows how a candidate is willing to collaborate on a problem and
| react to new ideas.
|
| Heaps are an interesting data structure to use for interviewing,
| but I do find they can be a little less intuitive, and their use
| cases translate less directly to business logic and are more of
| an optimization task. It depends on what talents you're looking
| for on your team!
| jhp123 wrote:
| compiling the dictionary into a DFA would give you O(n) runtime.
| lightbendover wrote:
| Both the author's and your run-times pretty plainly illustrate
| everything wrong with BigO.
| jhp123 wrote:
| ?
| behdad wrote:
| How?
| jhp123 wrote:
| use a regex like (aardvark|apple|...|zebra)* and the standard
| DFA construction for regular expressions.
| behdad wrote:
| You are technically right. Though if we assume that the
| size of the dictionary is at least o(n), then the size of
| such DFA will be exponential in n, and indexing it will be
| o(n), resulting in a o(n^2) solution again, I think.
| [deleted]
| ghusbands wrote:
| They are completely correct. If the DFA fits in RAM,
| following state transitions will be O(1) and using the
| DFA for a concatenated string of length N will be O(N).
| You simply missed a good solution.
| jhp123 wrote:
| The number of states is exponential for a general regex.
| This regex has an NFA closely resembling the trie for the
| dictionary. There could be at most one live state per
| depth in the trie. So the number of possible states is
| bounded at least by (words in dictionary)^(length of
| longest word). Really much smaller because the states at
| different levels have to share prefixes.
| bonzini wrote:
| Without the *, you get a graph that is a compressed version
| of the trie. With the *, the size of the DFA is extremely
| likely to explode.
| behdad wrote:
| Between 2010 and 2019 I interviewed dozens of Software Engineer
| candidates at Google. Almost always I asked the same interview
| question. Moreover, this question happened to be on the banned
| list at Google, because it was publicly available on Glassdoor
| and other interview websites, but I continued to use it because I
| got good signal from the candidates.
| foobarian wrote:
| I stuck to the same question for nearly a decade too, even if
| it was barely more difficult than fizzbuzz. The ability to
| compare against the past candidates was very useful, which
| would not be possible when changing questions too often.
|
| The sad reality was that it was waaaay too effective as a
| fizzbuzz eliminator (> 80%) which I don't know what it tells me
| about the candidates in general or just about our recruiting
| :-)
| yamtaddle wrote:
| > The sad reality was that it was waaaay too effective as a
| fizzbuzz eliminator (> 80%) which I don't know what it tells
| me about the candidates in general or just about our
| recruiting :-)
|
| I suspect there are a lot more people who frequently choke on
| easy coding questions, under the very specific conditions of
| interview-stress (which is very different from, "the client
| is unhappy and is also onsite and you'll be in a meeting with
| them later" stress, and from "production broke" stress, so
| no, I don't think it's a good proxy for general "grace under
| fire", as it were, either) than there are actual full-on
| bullshitters who manage to land software jobs while being
| _truly_ incapable of doing even the basics of the job.
|
| For one thing, I think such top-tier, extremely-convincing
| bullshitters would have an easier and less-stressful time
| applying that skill directly to some other role, since there
| are some well-paid roles where that kind of thing is exactly
| what companies want, so it doesn't pass the smell test for me
| that so _very_ many would be trying to land software jobs--
| not just ones who exaggerate skill or try to get into the
| industry while woefully unprepared, but who _also_ are able
| to fool interviewers at a reasonably-high rate in
| conversational interviews, yet fail simple coding tests.
|
| That is, I suspect a large majority of "ha! I caught a
| faker!" anecdotes or anec-data are actually a false-negative
| on the test from people choking under interview conditions,
| and that most _real_ fakers would also have been caught in
| any half-competent conversational interview anyway.
|
| The remainder should be probably be hired regardless, then
| re-homed to sales when you realize what's going on, since
| they're evidently top-percentile bullshitters and social
| chameleons. Joking, of course... kinda. Maybe.
| medvezhenok wrote:
| There's certainly part of that, but there's also likely
| selection/sample bias. The better the programmer, the less
| time they probably spend in interviews. And the "reward" on
| passing a programming interview if you're marginal is quite
| high, depending on your alternative choices of employment.
|
| (Top tier Ivy League university acceptance rate is 3-10%,
| so if you're looking for someone of at least that caliber
| among the general population (or even among developers), an
| 80% rejection rate might not be crazy. FAANG (or its ilk)
| are the "Ivy League" of Software Development jobs)
| activitypea wrote:
| "A great interview question" for what? The author sings their
| praises, but never actually explains what this question is good
| at testing or what makes it worth an interviewer and candidate's
| time. Results _might_ favor younger or earlier-in-career folks --
| okay, why is that a good thing?
| musicale wrote:
| Younger/earlier in career folks are usually cheaper, which
| companies like.
|
| Though I imagine they could just ask "when was your last
| algorithms course?" (and possibly where, and what grade you
| got) and save a lot of time.
|
| But that wouldn't be as useful for "making the candidate jump
| through more hoops" and reinforcing compliance and status
| hierarchy, which is a major purpose of interviews.
| jacknews wrote:
| What kind of programming will be done on the job by these
| candidates? Probably nothing like this puzzle.
|
| This kind of 'let's watch people struggle with my favourite LC
| puzzle, which I know inside-out' just seems very inconsistent,
| not very predictive, and just quite flaky to me.
|
| On the job, my very first response to any kind of problem that
| looks algorithmic is to google the hell out of it, even if I
| think I know the answer already. In future, it will be to get gpt
| to write it.
|
| I think the actual challenges in real-world software are not
| algorithmic, but architectural and beyond, and then only after
| thoroughly understanding the problem in the first place.
| bonzini wrote:
| I am guilty as charged of saying impulsively that the first
| solution is O(n)!
|
| The reverse trie is genius, but with dynamic programming do you
| still need it? I would say that if in the general setting you can
| afford O(n^2) preparatory steps, and therefore you can just
| remember which (start, end) pairs correspond to a valid word with
| regular trie lookups. That is O(n) steps for each start position,
| giving a total of O(n^2).
| eesmith wrote:
| Doesn't that confuse two uses of n?
|
| One is the length of the word. The other is the number of
| dictionary elements.
|
| If your language is constrained to 1,000 words, then the
| dictionary will be pretty small and the hash table can be
| constructed with no chains. That should have constant-time
| lookup no matter how long the query concatenation is.
|
| I don't see why O(n^2) should be optimal. The question is only
| to determine if a match exists, and that can be expressed as a
| regular expression. In the following I omit all 1-letter words
| since my dictionary includes all the letters.
| >>> words = [line.strip() for line in
| open("/usr/share/dict/words") if len(line.strip()) > 1]
| >>> words.sort(key=len, reverse=True) # because Python uses
| left-to-right match, not longest >>> pattern = "(" +
| "|".join(words) + ")+$" >>> import re >>> matcher =
| re.compile(pattern) # takes a few seconds >>>
| matcher.match("ducksoup") <re.Match object; span=(0, 8),
| match='ducksoup'> >>> matcher.match("ducksquom")
| >>> matcher.match("theremanywordsinthisone") <re.Match
| object; span=(0, 23), match='theremanywordsinthisone'>
| >>> matcher.match("theremanywordsqinthisone") is None
| True
|
| Python uses backtracking, so this probably isn't O(n),
| especially with the ability to choose the dictionary.
|
| But with there are non-backtracking matchers which would make
| this O(n). Here's re2 from https://github.com/google/re2 :
| >>> import re2 >>> opts = re2.Options() >>>
| opts.max_mem = 200000000 >>> matcher =
| re2.compile(pattern, opts) >>> match =
| matcher.match("ducksoup") re2/re2.cc:821: DFA out of
| memory: pattern length 2493008, program size 1323834,
| list count 1089251, bytemap range 54 >>> match.start(),
| match.end() (0, 8) >>> match =
| matcher.match("ducksquom") re2/re2.cc:821: DFA out of
| memory: pattern length 2493008, program size 1323834,
| list count 1089251, bytemap range 54 >>> match is None
| True
|
| This is first time I've used re2. There's probably a way to
| prevent that warning message - I believe this is falling back
| to an NFA, which might not be O(n)?
|
| Anyway, here it is processing the concatenation of 100,000
| words selected at random (I've omitted the warning message):
| >>> s="".join(random.sample(words, 100000)) >>> len(s)
| 956249 >>> import time >>> t1 = time.time();
| print(f"is concatenation? {matcher.match(s) is not None} " +
| f"time: {time.time()-t1:.1f} seconds") is concatenation?
| True time: 12.3 seconds >>> t1 = time.time(); print(f"is
| concatenation? {matcher.match(s+'Q') is not None} " + f"time:
| {time.time()-t1:.1f} seconds") is concatenation? False
| time: 12.3 seconds
|
| Cut the size by 10 and it's about 10x faster:
| >>> s="".join(random.sample(words, 10000)) >>> len(s)
| 95457 >>> t1 = time.time(); print(f"is concatenation?
| {matcher.match(s) is not None} " + f"time: {time.time()-t1:.1f}
| seconds") is concatenation? True time: 1.3 seconds
|
| For this test it appears to be linear in the query size.
|
| Edit: if I increase max_mem to 2_000_000_000 then I don't get
| the warning message. But then the DFA-based search takes 141.5
| seconds. That should be O(n) time in the query length, but an
| order of magnitude slower than the NFA-based search.
|
| That's why O(n) analysis isn't enough to really figure out "a
| faster solution."
| bonzini wrote:
| In the article n always refers to the length of the string.
| behdad wrote:
| With the dynamic-programming you don't need the reverse trie
| indeed.
___________________________________________________________________
(page generated 2023-04-12 23:01 UTC)