[HN Gopher] Hellandizing (1998)
___________________________________________________________________
Hellandizing (1998)
Author : redeemed
Score : 33 points
Date : 2024-03-28 05:01 UTC (1 days ago)
(HTM) web link (www.multicians.org)
(TXT) w3m dump (www.multicians.org)
| tiahura wrote:
| TIL NonStop lives.
|
| https://en.wikipedia.org/wiki/NonStop_(server_computers)
|
| https://www.hpe.com/us/en/compute/nonstop-servers.html
| onetimeuse92304 wrote:
| I wrote something similar in the past.
|
| It was credit card terminal application. It was supposed to
| behave correctly regardless of when in the process of transaction
| something happened. Something could be application crash, power
| cycle, etc. Users were frequently impatient with the slower
| communication methods (we had land line and GPRS modems back
| then) and would frequently power cycle the terminal if it took a
| moment too long to do something.
|
| I designed the application in such a way that it always moves
| transactionally between series of well defined states with no
| observable states in between. Then I put test points between the
| states to crash the application at random. Many of these points
| were put in places in code which would be extremely difficult to
| test other way (it was extremely unlikely a power cycle would
| happen naturally at that exact point in time, between those two
| exact instructions).
|
| The application had capability to crash at literally any point
| and simply continue operation once power cycled.
| andai wrote:
| Fascinating. Was it able to do this by constantly writing to
| persistent storage?
| onetimeuse92304 wrote:
| Yes. I wrote append-only database for the device. I started
| with the database because I needed a transactional storage
| but the device used flash chips without wear levelling. This
| and requirement for constant memory usage basically
| disqualified any existing database. The device was very
| limited in memory, only about 1MB available to the
| application of which 600kB was used up by OpenSSL.
|
| The database I implemented did not allocate any memory and
| used a constant amount of stack which was important for me to
| ensure the application can be verified statically.
|
| So the database worked by having two files allocated. The
| application would append data to one file, then when it was
| full it would copy live entries to the other and start
| appending there and so on.
|
| This might sound wasteful but in reality write magnification
| was very low. There was very little data that needed to be
| copied, most records were created and then promptly deleted
| when it was reconciled with the server.
|
| All sorts of data was written to the storage. I started by
| writing just basic transaction information but then I
| discovered that I can also log other information (UI state,
| etc.) to recover state in case of power failure. This was
| extremely efficient, most UI operations would result in only
| a single byte written to the flash. This was important as the
| flash had quite limited durability and was typically a
| limiting factor for the longevity of the device.
|
| **
|
| Now that I remember, there were other fun tricks I did.
|
| One of them was for the OpenSSL. This super underpowered
| device took no less than 9 seconds to open SSL connection
| over GPRS.
|
| That was fine when the device was first designed and we
| deployed lots of them. Initially, we did not need SSL and we
| did not need to open any connections to complete the
| transaction. We only did that later, once a day, to send all
| of the information and reconcile with the server.
|
| But at some point we had to deploy ability to do online check
| with the bank and also required to have all connections
| secured with SSL. 9 seconds for the client to wait on the
| transaction was definitely not acceptable.
|
| While we could try to keep the connection open, in many cases
| the device would be deployed in places with poor connectivity
| and it was just unreasonable expectation.
|
| I saved the company A TON of cash by figuring out I can gut
| the OpenSSL library to be able to manually save and restore
| cryptographic state of the connection (symmetric
| cryptography). I did the same for the system that terminated
| connections on the server.
|
| The application would connect to the server, skip entire
| handshake communication (not even a single handshake packet)
| and would immediately, speculatively switch to the stored
| symmetric key.
|
| The server kept a database of all most recent cryptographic
| states with each of the known terminals and would try to
| match the communication with its own stored cryptographic
| state. If this worked, it would continue as if nothing
| happened. If it did not, it would close the connection. The
| terminal would then restart the connection with the complete
| handshake from scratch. But it very rarely had to. As it was
| very successful, we had to add a functionality to force to
| clear the state every night so that it we knew there is at
| least one fresh connection every day.
|
| This cut down almost all of the overhead of OpenSSL.
| twic wrote:
| Did something like and unlike this a while ago. Payment
| processing system, multiple microservices written in Java on a
| cloud platform. Tool runs on a developer machine, uses the cloud
| tools to SSH into a running container and run jdb. It can add a
| breakpoint, wait for it to get hit, then do a selected thing -
| resume immediately, delay a while then resume, throw an
| exception, hang forever, etc.
|
| The main tool can also manage the platform, so start the app,
| kill individual containers, etc. And inject payment messages, and
| wait for payments to be sent out.
|
| So you could write test plans like "add a breakpoint in
| ValidateAccount::isValidSortCode, inject a payment from customer
| A, wait for the breakpoint to get hit, inject a payment from
| customer B, throw a NullPointerException, then check that the
| payment from customer B gets delivered".
|
| We didn't do systematic testing of failure at every point, as in
| the article, but it was very nice to be able to automate testing
| of error cases without having to have special code in the app for
| it.
| jdblair wrote:
| A question about how nonstop worked:
|
| If both processors are running the same process in the same
| state, why won't both processors hit the same error condition at
| the same time?
|
| I understand there are random hardware faults that can happen,
| bits can flip, etc., but logic errors should be bug-for-bug the
| same on both processors.
|
| So, were those random faults so frequent that the redundancy was
| worth it? Or am I missing something?
___________________________________________________________________
(page generated 2024-03-29 23:00 UTC)