[HN Gopher] Bresenham's Circle Drawing Algorithm (2021)
       ___________________________________________________________________
        
       Bresenham's Circle Drawing Algorithm (2021)
        
       Author : RojerGS
       Score  : 85 points
       Date   : 2024-08-30 22:53 UTC (1 days ago)
        
 (HTM) web link (funloop.org)
 (TXT) w3m dump (funloop.org)
        
       | KingOfCoders wrote:
       | What I found most amazing about Bresenham during my Amiga days,
       | is how you can use it to zoom and shrink bitmaps.
        
       | corysama wrote:
       | Do note that Bresenham's family of algorithms (and much of the
       | venerable Computer Graphics Principles and Practice) are from a
       | bygone era where computers executed 1 instruction per cycle
       | without pipelining or prediction.
       | 
       | These days processors prefer to draw shapes by having a coarse
       | grained pass that conservatively selects tiles of interest then
       | brute-force evaluates each pixel in each tile independently of
       | all others.
       | 
       | Instead of minimizing total work, the goal is to maximize
       | pipelined parallelism while skipping over unnecessary work in
       | large blocks.
        
         | eru wrote:
         | It's also from an era when floats were rather expensive.
        
           | mabster wrote:
           | I still picture them as expensive. Things like trig functions
           | are still very expensive.
        
             | eru wrote:
             | Yes, but here it's about avoiding multiplication (and
             | division).
             | 
             | I suspect on a modern processor the branches (ie "if"s) in
             | Bresenham's algorithm are gonna be more expensive than the
             | multiplications and divisions in the naive algorithm.
        
               | ack_complete wrote:
               | Bresenham is easy to implement branchlessly using
               | conditional moves or arithmetic. It also produces a
               | regular pattern of branches that is favorable for modern
               | branch predictors that can do pattern prediction.
        
             | dahart wrote:
             | I learned relatively recently that trig functions on the
             | GPU are free if you don't use too many of them; there's a
             | separate hardware pipe so they can execute in parallel with
             | floats adds and muls. There's still extra latency, but
             | it'll hide if there's enough other stuff in the vicinity.
        
               | sischoel wrote:
               | The CUDA documenation tells me that there are more
               | performant but less precise trigonometric functions:
               | https://docs.nvidia.com/cuda/cuda-c-programming-
               | guide/index....
               | 
               | Do you know if that hardware pipeline works only for
               | these intrinsic variants?
        
               | dahart wrote:
               | Yep, these intrinsics are what I was referring to, and
               | yes the software versions won't use the hardware trig
               | unit, they'll be written using an approximating spline
               | and/or Newton's method, I would assume, probably mostly
               | using adds and multiplies. Note the loss of precision
               | with these fast-math intrinsics isn't very much, it's
               | usually like 1 or 2 bits at most.
        
         | mysterydip wrote:
         | "yes, but" there's still places that can take advantage of such
         | algorithms today, namely microcontrollers. I think some
         | scripting languages may even apply here, although many
         | interpreters do some level of compilation/optimization instead
         | of serial execution.
        
         | Someone wrote:
         | > a bygone era where computers executed 1 instruction per cycle
         | without pipelining or prediction
         | 
         | 1 instruction per cycle? What luxury bygone era did you grow up
         | in?
         | 
         | Wikipedia tells me the algorithm is from 1962 on an IBM 1401 (h
         | ttps://en.wikipedia.org/wiki/Bresenham's_line_algorithm#His...
         | 
         | That definitely didn't have many single cycle instructions.
         | Skimming https://ibm-1401.info/A24-6447-0_1401_1460_Instruction
         | _and_T..., I couldn't find any.
         | 
         | Certainly, in the era of 8-bit CPUs like Z80 and 6502
         | programmers would have been lyric about "1 instruction per
         | cycle"
         | 
         | Actually, did any CPU ever "execute 1 instruction per cycle
         | without pipelining or prediction" (or, slightly looser "had
         | fixed time instructions without pipelining or prediction")?
         | 
         | RISC introduced fixed time instructions, but also pipelining.
        
           | owisd wrote:
           | Probably meant 'cycle' in the sense of instruction cycle,
           | rather than clock cycle.
        
             | Someone wrote:
             | Even then, few CPUs execute all instructions in the same
             | amount of time. Division, for example, still takes longer
             | than a register move on most architectures. Multiplication,
             | historically, did too.
             | 
             | As a sibling comment points out, DSPs can be an exception.
             | I'm far from an expert on them, but CPUs that try to run
             | all instructions in the same fixed amount of time used to
             | accomplish that by avoiding complex addressing modes and by
             | omitting the really slow instructions (division and
             | instructions that push/pop multiple registers onto/from the
             | stack being prime examples).
             | 
             | IIRC, some of them also accomplished it by running some of
             | the simpler instructions slower than necessary (raw speed
             | isn't as important as being hard real time isn many
             | applications)
        
               | kragen wrote:
               | > _few CPUs execute all instructions in the same amount
               | of time. Division, for example, still takes longer than a
               | register move on most architectures_
               | 
               | easy solution, as you allude to, don't have a division
               | instruction! arm doesn't, 8048 doesn't, sparcv7 doesn't,
               | even the cdc 6600 and cray-1 didn't, and even risc-v
               | finally got zmmul:
               | https://wiki.riscv.org/display/HOME/Zmmul. it's not just
               | dsps
               | 
               | the big issue with complex addressing modes is i think
               | fault handling. if your nice orthogonal _add_ instruction
               | updates two registers as it reads operands and a third
               | register with the sum of the operands, what do you do if
               | you get a page fault on the second operand? if the os can
               | service the page fault, how does it restart the
               | instruction?
               | 
               | as you point out, real-time latency is also an issue. on
               | older arm chips you have to be careful not to ldm or stm
               | too many registers in a single instruction so as not to
               | damage interrupt latency. newer arm chips can restart the
               | ldm/stm
               | 
               | i'm no expert on the area either
        
           | chillingeffect wrote:
           | Adsp-218x family of harvard architecture DSPs. Each 25 ns
           | clock cycle = 1 MAC, 1 data fetch, and one instruction fetch.
           | All instructions 1 cycle long. And many other gizmos like
           | reversible address bit ranges for DFT in place, separate
           | register bank for IRQ handlers. And all laid out by hand.
        
           | buescher wrote:
           | I remember the bragging point on the RTX2000 in the very late
           | eighties was "a MIP per megahertz".
        
             | kragen wrote:
             | my limited experience with stack machines is that a stack
             | mip is about half a register-machine mip :-(
             | 
             | consider this simple assembly-language subroutine i wrote
             | in october (http://canonical.org/~kragen/sw/dev3/tetris.S,
             | screencast at https://asciinema.org/a/622461):
             | @@ Set sleep for iwait to r0 milliseconds.
             | @@ (r0 must be under 1000)                 .thumb_func
             | waitis: ldr r2, =wait           @ struct timeval
             | movs r3, #0             @ 0 sec                 str r3,
             | [r2]            @ .tv_sec = 0                 ldr r1, =1000
             | @ multiplier for ms                 mul r0, r1
             | str r0, [r2, #4]        @ set .tv_usec                 bx
             | lr                      .bss         wait:   .fill 8
             | @ the struct timeval                 .text
             | 
             | these are all perfectly normal register-machine
             | instructions; you could translate them one-to-one to almost
             | any register machine. on a few of them you could drop one,
             | writing something like str #0, [wait]. the whole function
             | is straight-line execution, a single basic block, and seven
             | instructions long. this is almost a best case for a stack
             | machine; it looks like this:                   lit(0)
             | ; load immediate constant         lea(wait)   ; load
             | address of wait onto stack         !           ; store 0 in
             | wait         lit(1000)   ; another constant         *
             | ; multiply argument by constant         lea(wait)   ; load
             | address again         lit(4)      ; offset of .tv_usec
             | +           ; calculate address of wait.tv_usec         !
             | ; store product in tv_usec         ret         ; return
             | from subroutine
             | 
             | that's 10 instructions, about 50% longer than the 6 or 7 of
             | the two-operand register machine. but the typical case is
             | worse. and basically the reason is that the average number
             | of stack manipulations or constant pushes that you need to
             | do to get the right operands on top of the stack for your
             | memory accesses and computational operations is roughly 1.
             | sometimes you'll have a * or ! (store) or + that's not
             | preceded by a stack manipulation or a load-immediate or a
             | load-address operation, and that time the stack machine
             | wins, but other times it'll be preceded by two or three of
             | them, and that time the stack machine loses
             | 
             | the rtx-2000 did some tricks to sometimes do more than a
             | single stack operation per cycle, but it didn't really
             | escape from this
             | 
             | i don't remember if dr. ting wrote a book about the
             | rtx-2000, but he did write a very nice book about its
             | predecessor the nc4000, going into some of these tricks:
             | https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf
        
           | kragen wrote:
           | no, you're right, you almost always need pipelining to get
           | one instruction per clock cycle
           | 
           | but there are a _lot_ of cpus out there--maybe the majority
           | now--that are pipelined microarchitectures that get one
           | instruction per cycle without much or any prediction. avrs,
           | most of the cortex-m* (all?), most modern implementations of
           | old slow isas like the z80 and 8051, etc. big processors like
           | your laptop cpu and cellphone cpu are of course superscalar,
           | but they are a tiny minority of all processors. even inside
           | the cellphone case, they 're outnumbered by in-order scalar
           | microcontrollers
           | 
           | without prediction, of course, you have a pipeline bubble
           | every time you have a branch, so you never quite hit 1 ipc
           | with scalar execution. but it's usually pretty close, and
           | even with prediction, sometimes you miss. and usually if you
           | have branch prediction, you also have a cache, because ain't
           | nobody got time to share one triflin memory bus between
           | instructions and data
           | 
           | so pipelining gets you from, say, 3 clocks per instruction
           | down to 1.2 or so. then prediction gets you from 1.2 down to,
           | say, 1.02
        
         | _0ffh wrote:
         | Nah, while cycles/instruction where indeed fixed in those days
         | (and for some time yet to come), it was not necessarily 1 cycle
         | but rather depended on the instruction.
        
           | tengwar2 wrote:
           | Indeed. I used this algorithm on OS/2 1.0 as part of a GUI
           | (OS/2 did not have a GUI until 1.1). That was on an 80386.
           | MASM came with a nice ring-bound A6 book which summarised the
           | instruction set, including timings. I seem to remember that 3
           | cycles was normal for a short instruction, but many were
           | considerably longer.
        
         | amelius wrote:
         | Sounds like your algorithm is from a bygone era where power was
         | not important ;)
        
           | corysama wrote:
           | If you can get the work done fast and hurry the processor
           | back into a low-power sleep state ASAP, it can actually be
           | power efficient too!
        
         | phire wrote:
         | 1 instruction per cycle? No, that's only possible with
         | pipelining.
         | 
         | We are talking about instructions which took 8-12 cycles to
         | complete.
        
           | kragen wrote:
           | usually this is correct, but there are some exceptions. most
           | instructions on a non-pipelined but synchronous stack machine
           | like the mup21 take a single cycle, for example
           | 
           | even with a register file, it isn't really inherent that you
           | need to decode inputs, do alu operations, and write outputs
           | in separate clock cycles; you can do all of that in
           | combinational logic except writing the outputs, and you can
           | even decode which register to write the output to. it just
           | means your max clock rate is in the toilet
           | 
           | for that kind of thing a harvard architecture is pretty
           | useful; it allows you to read an instruction in instruction
           | memory at the same time you're reading or writing data in
           | data memory, instead of in two separate cycles
        
         | chiph wrote:
         | I think it depends. I had Dr. Bresenham as my graphics
         | instructor (he taught for a while after he retired from IBM)
         | and the class used Borland Turbo Pascal and for it's time it
         | was fast. Not as fast as raw assembly. But faster than Borlands
         | Turbo C that had just come out.
         | 
         | So far as 1 instruction per cycle - Wikipedia says the 80286
         | (the top dog PC processor of the time) could execute 0.21
         | "typical" instructions per clock. Optimized code was up to 0.5
         | instructions per clock. And that agrees with my memories.
         | 
         | Today, I would try and use parallelism if I could. With lots of
         | conditions though - is the process being applied to the image
         | trivially parallelizable, will most/all of it fit in cache,
         | etc. Trying to parallelize Bresenham's algorithms though would
         | be futile - when drawing circles you can reflect it into the
         | different quadrants (big savings) but the algorithm itself is
         | going to be pretty serial because it has to keep up with the
         | error coefficient as it draws.
        
         | brcmthrowaway wrote:
         | Got an exsmple?
        
       | bsenftner wrote:
       | My computer graphics professor back in the early 80's was Prof.
       | Glen Bresenham, but not The Bresenham. It was a lot of fun being
       | at SIGGRAPH back then and watching people freak upon reading his
       | name badge. He'd let them believe for a bit, and then explain
       | it's not him. Al Acorn was one that freaked, and that was fun.
        
       | possiblywrong wrote:
       | > Note that if F(x,y)=0, then the point (x,y) is exactly on the
       | circle. If F(x,y)>0, then the point is outside of the circle, and
       | if F(x,y)<0 then the point is inside of it. _In other words,
       | given any point (x,y), F(x,y) is the distance from the true
       | circle line_ [my emphasis].
       | 
       | This last is not quite true. The exact distance from the circle,
       | call it G(x,y), is the corresponding difference of square roots,
       | i.e.,                 def G(x, y, r):         return math.sqrt(x
       | * x + y * y) - math.sqrt(r * r)
       | 
       | and G(x,y) isn't just the square root of F(x,y), and indeed
       | doesn't behave monotonically with respect to F(x,y).
       | 
       | It's an interest property of Bresenham's algorithm, that I've
       | never seen even stated let alone proved in the literature, that
       | this doesn't matter, and the algorithm is indeed exact in the
       | sense that it always chooses the next point based on which is
       | _truly_ closest to the circle... despite using an error function
       | that is only an approximation.
        
       | ericyd wrote:
       | A cool write up with an approachable explanation of algorithm
       | optimization! I wonder how long it would take me to arrive at
       | this algorithm left to my own devices...
        
       ___________________________________________________________________
       (page generated 2024-08-31 23:01 UTC)