[HN Gopher] Differentiable Finite State Machines
___________________________________________________________________
Differentiable Finite State Machines
Author : hardmaru
Score : 112 points
Date : 2022-06-08 05:44 UTC (17 hours ago)
(HTM) web link (google-research.github.io)
(TXT) w3m dump (google-research.github.io)
| mcovalt wrote:
| The intersection of graphs, FSMs, and linear algebra is really
| wild. I touched a very tiny bit on this topic a few weeks ago
| [0]. I highly recommend the book "Mathematics of Big Data" [1] if
| you're interested in this subject.
|
| [0] https://news.ycombinator.com/item?id=31344146
|
| [1] https://mitpress.mit.edu/books/mathematics-big-data
| time_to_smile wrote:
| For anyone coming in a bit confused by this, the magic here is in
| the concept of "differentiable". I'm fairly frequently touting
| the benefits of thinking in terms of differentiable programming
| and the most common response is "what does that even mean?"
|
| The core idea is that any program that is "differentiable" means
| you can take the derivative of that program the same way you can
| take the derivative of a simple function in calculus.
|
| Why does that matter? Well being able to take the derivative of
| any problem in turn means you can then optimize that problem with
| gradient descent. Chris Olah recently had a great discussion on
| Twitter [0] about what's great about gradient descent, in short:
|
| > Simple gradient descent creates mind-boggling structure and
| behavior, just as evolution creates the awe inspiring complexity
| of nature.
|
| Differentiable programming is, in a strange way, similar to logic
| programming in the sense that you can start solving problems by
| defining what the solution looks like. In this example you define
| the input and desired output and then learn the FSM that maps one
| to the other. Recall from Theory of Computation classes that a
| FSM the simplest, least powerful programming language (FSM an
| example of a regular language).
|
| What's amazing here is that you're essentially seeing gradient
| descent writing a program. IMHO, that is insanely cool.
|
| What makes differentiable programming so interesting compared to
| something like logic programming is that you have a potentially
| much larger space of problems that can be solved in this manner
| _and_ compared to the backtracking algorithm used by logic
| programming languages, gradient descent allows for potentially
| faster (though also approximate) solutions at scale.
|
| Currently the vast majority of differentiable programming
| examples are essentially just building neural nets, but examples
| like this one show that we can think much more broadly about
| problems in terms of differentiable programming as a proper
| programming paradigm.
|
| 0.
| https://twitter.com/ch402/status/1533164918886703104?cxt=HHw...
| melony wrote:
| Regular expressions were invented to model neurons (predating
| gradient descent) by Kleene of the Kleene star.
|
| C.f. _" Representation of Events in Nerve Nets and Finite
| Automata"_
|
| Before backpropagation was a glimmer in LeCun's eyes.
| nshm wrote:
| Seems like in this differentiable FSM implementation you only
| differentiate edge weighs, you can not introduce or change
| nodes. So it is quite far from differentiation of programs.
| uoaei wrote:
| But if you add a huge number of nodes and put regularization
| on the sparsity of the edge weights, you have essentially a
| model which can adapt to problems using subsets of its
| structure. Somewhat like the "lottery ticket" hypothesis in
| NN theory. Then you can think of each traversal choice as a
| conditional branch and voila, the program runs!
|
| You can see this in effect actually in the article when the
| author uses the penalty on the entropy of the transition
| probabilities.
| xbar wrote:
| You had me at finite.
| novosel wrote:
| I've done a triple-take on this also. For some reason I thought
| of differentiation in the state-space. Which is, off-course,
| oximoronic. Fuzzy logic-like on n-th degree???
| albertzeyer wrote:
| This uses dense (soft/weighted) transitions from any state to any
| state, and then some regularization to guide it to more sparse
| solutions.
|
| In practice, the number of states can be huge (thousands, maybe
| millions), so representing this as a dense matrix (a 1Mx1M matrix
| is way too big) is not going to work. It must be sparse, and in
| practice (all FST you usually deal with) it is. So it's very much
| a waste to represent it as a dense matrix.
|
| That's why there are many specialized libraries to deal with
| FSTs. Also in combination with deep learning tools. See e.g. K2
| (https://github.com/k2-fsa/k2).
| puttycat wrote:
| Indeed this would need to predefine a maximum number of states.
|
| This work referenced also in the OP paper uses recurrent neural
| nets and a genetic algorithm to evolve the minimal number of
| states needed:
|
| https://arxiv.org/abs/2111.00600
| jpeg_hero wrote:
| seems complicated.
| brandonb wrote:
| Although this blog post focuses on learning small toy examples, I
| could see the approach scaling up to real problems in natural
| language processing.
|
| At one point, Google's own speech recognition system was built on
| a C++ finite state transducer library [1][2]. FSTs were used to
| represent the language model, reverse dictionary (map from
| phonemes to words), hidden markov models, and so on. Although
| FSTs were a common abstraction used at runtime, at training time,
| we still needed special-purpose ML algorithms to create models
| that were later "compiled" into FSTs. [3]
|
| So I could see how a general-purpose differentiable FSM, if
| scaled up, could actually produce a really powerful
| simplification -- you would just need one end-to-end learning
| algorithm to subsume all these various models.
|
| [1] https://www.openfst.org/twiki/bin/view/FST/WebHome
|
| [2] Much of this has obviously changed since the popularization
| of deep learning -- this is the 2011-era system.
|
| [3] A finite state transducer is a finite state machine with
| input and output labels, so that it translates between two
| regular languages
___________________________________________________________________
(page generated 2022-06-08 23:00 UTC)