[HN Gopher] Hellandizing (1998)
       ___________________________________________________________________
        
       Hellandizing (1998)
        
       Author : redeemed
       Score  : 33 points
       Date   : 2024-03-28 05:01 UTC (1 days ago)
        
 (HTM) web link (www.multicians.org)
 (TXT) w3m dump (www.multicians.org)
        
       | tiahura wrote:
       | TIL NonStop lives.
       | 
       | https://en.wikipedia.org/wiki/NonStop_(server_computers)
       | 
       | https://www.hpe.com/us/en/compute/nonstop-servers.html
        
       | onetimeuse92304 wrote:
       | I wrote something similar in the past.
       | 
       | It was credit card terminal application. It was supposed to
       | behave correctly regardless of when in the process of transaction
       | something happened. Something could be application crash, power
       | cycle, etc. Users were frequently impatient with the slower
       | communication methods (we had land line and GPRS modems back
       | then) and would frequently power cycle the terminal if it took a
       | moment too long to do something.
       | 
       | I designed the application in such a way that it always moves
       | transactionally between series of well defined states with no
       | observable states in between. Then I put test points between the
       | states to crash the application at random. Many of these points
       | were put in places in code which would be extremely difficult to
       | test other way (it was extremely unlikely a power cycle would
       | happen naturally at that exact point in time, between those two
       | exact instructions).
       | 
       | The application had capability to crash at literally any point
       | and simply continue operation once power cycled.
        
         | andai wrote:
         | Fascinating. Was it able to do this by constantly writing to
         | persistent storage?
        
           | onetimeuse92304 wrote:
           | Yes. I wrote append-only database for the device. I started
           | with the database because I needed a transactional storage
           | but the device used flash chips without wear levelling. This
           | and requirement for constant memory usage basically
           | disqualified any existing database. The device was very
           | limited in memory, only about 1MB available to the
           | application of which 600kB was used up by OpenSSL.
           | 
           | The database I implemented did not allocate any memory and
           | used a constant amount of stack which was important for me to
           | ensure the application can be verified statically.
           | 
           | So the database worked by having two files allocated. The
           | application would append data to one file, then when it was
           | full it would copy live entries to the other and start
           | appending there and so on.
           | 
           | This might sound wasteful but in reality write magnification
           | was very low. There was very little data that needed to be
           | copied, most records were created and then promptly deleted
           | when it was reconciled with the server.
           | 
           | All sorts of data was written to the storage. I started by
           | writing just basic transaction information but then I
           | discovered that I can also log other information (UI state,
           | etc.) to recover state in case of power failure. This was
           | extremely efficient, most UI operations would result in only
           | a single byte written to the flash. This was important as the
           | flash had quite limited durability and was typically a
           | limiting factor for the longevity of the device.
           | 
           | **
           | 
           | Now that I remember, there were other fun tricks I did.
           | 
           | One of them was for the OpenSSL. This super underpowered
           | device took no less than 9 seconds to open SSL connection
           | over GPRS.
           | 
           | That was fine when the device was first designed and we
           | deployed lots of them. Initially, we did not need SSL and we
           | did not need to open any connections to complete the
           | transaction. We only did that later, once a day, to send all
           | of the information and reconcile with the server.
           | 
           | But at some point we had to deploy ability to do online check
           | with the bank and also required to have all connections
           | secured with SSL. 9 seconds for the client to wait on the
           | transaction was definitely not acceptable.
           | 
           | While we could try to keep the connection open, in many cases
           | the device would be deployed in places with poor connectivity
           | and it was just unreasonable expectation.
           | 
           | I saved the company A TON of cash by figuring out I can gut
           | the OpenSSL library to be able to manually save and restore
           | cryptographic state of the connection (symmetric
           | cryptography). I did the same for the system that terminated
           | connections on the server.
           | 
           | The application would connect to the server, skip entire
           | handshake communication (not even a single handshake packet)
           | and would immediately, speculatively switch to the stored
           | symmetric key.
           | 
           | The server kept a database of all most recent cryptographic
           | states with each of the known terminals and would try to
           | match the communication with its own stored cryptographic
           | state. If this worked, it would continue as if nothing
           | happened. If it did not, it would close the connection. The
           | terminal would then restart the connection with the complete
           | handshake from scratch. But it very rarely had to. As it was
           | very successful, we had to add a functionality to force to
           | clear the state every night so that it we knew there is at
           | least one fresh connection every day.
           | 
           | This cut down almost all of the overhead of OpenSSL.
        
       | twic wrote:
       | Did something like and unlike this a while ago. Payment
       | processing system, multiple microservices written in Java on a
       | cloud platform. Tool runs on a developer machine, uses the cloud
       | tools to SSH into a running container and run jdb. It can add a
       | breakpoint, wait for it to get hit, then do a selected thing -
       | resume immediately, delay a while then resume, throw an
       | exception, hang forever, etc.
       | 
       | The main tool can also manage the platform, so start the app,
       | kill individual containers, etc. And inject payment messages, and
       | wait for payments to be sent out.
       | 
       | So you could write test plans like "add a breakpoint in
       | ValidateAccount::isValidSortCode, inject a payment from customer
       | A, wait for the breakpoint to get hit, inject a payment from
       | customer B, throw a NullPointerException, then check that the
       | payment from customer B gets delivered".
       | 
       | We didn't do systematic testing of failure at every point, as in
       | the article, but it was very nice to be able to automate testing
       | of error cases without having to have special code in the app for
       | it.
        
       | jdblair wrote:
       | A question about how nonstop worked:
       | 
       | If both processors are running the same process in the same
       | state, why won't both processors hit the same error condition at
       | the same time?
       | 
       | I understand there are random hardware faults that can happen,
       | bits can flip, etc., but logic errors should be bug-for-bug the
       | same on both processors.
       | 
       | So, were those random faults so frequent that the redundancy was
       | worth it? Or am I missing something?
        
       ___________________________________________________________________
       (page generated 2024-03-29 23:00 UTC)