[HN Gopher] Bresenham's Circle Drawing Algorithm (2021)
___________________________________________________________________
Bresenham's Circle Drawing Algorithm (2021)
Author : RojerGS
Score : 85 points
Date : 2024-08-30 22:53 UTC (1 days ago)
(HTM) web link (funloop.org)
(TXT) w3m dump (funloop.org)
| KingOfCoders wrote:
| What I found most amazing about Bresenham during my Amiga days,
| is how you can use it to zoom and shrink bitmaps.
| corysama wrote:
| Do note that Bresenham's family of algorithms (and much of the
| venerable Computer Graphics Principles and Practice) are from a
| bygone era where computers executed 1 instruction per cycle
| without pipelining or prediction.
|
| These days processors prefer to draw shapes by having a coarse
| grained pass that conservatively selects tiles of interest then
| brute-force evaluates each pixel in each tile independently of
| all others.
|
| Instead of minimizing total work, the goal is to maximize
| pipelined parallelism while skipping over unnecessary work in
| large blocks.
| eru wrote:
| It's also from an era when floats were rather expensive.
| mabster wrote:
| I still picture them as expensive. Things like trig functions
| are still very expensive.
| eru wrote:
| Yes, but here it's about avoiding multiplication (and
| division).
|
| I suspect on a modern processor the branches (ie "if"s) in
| Bresenham's algorithm are gonna be more expensive than the
| multiplications and divisions in the naive algorithm.
| ack_complete wrote:
| Bresenham is easy to implement branchlessly using
| conditional moves or arithmetic. It also produces a
| regular pattern of branches that is favorable for modern
| branch predictors that can do pattern prediction.
| dahart wrote:
| I learned relatively recently that trig functions on the
| GPU are free if you don't use too many of them; there's a
| separate hardware pipe so they can execute in parallel with
| floats adds and muls. There's still extra latency, but
| it'll hide if there's enough other stuff in the vicinity.
| sischoel wrote:
| The CUDA documenation tells me that there are more
| performant but less precise trigonometric functions:
| https://docs.nvidia.com/cuda/cuda-c-programming-
| guide/index....
|
| Do you know if that hardware pipeline works only for
| these intrinsic variants?
| dahart wrote:
| Yep, these intrinsics are what I was referring to, and
| yes the software versions won't use the hardware trig
| unit, they'll be written using an approximating spline
| and/or Newton's method, I would assume, probably mostly
| using adds and multiplies. Note the loss of precision
| with these fast-math intrinsics isn't very much, it's
| usually like 1 or 2 bits at most.
| mysterydip wrote:
| "yes, but" there's still places that can take advantage of such
| algorithms today, namely microcontrollers. I think some
| scripting languages may even apply here, although many
| interpreters do some level of compilation/optimization instead
| of serial execution.
| Someone wrote:
| > a bygone era where computers executed 1 instruction per cycle
| without pipelining or prediction
|
| 1 instruction per cycle? What luxury bygone era did you grow up
| in?
|
| Wikipedia tells me the algorithm is from 1962 on an IBM 1401 (h
| ttps://en.wikipedia.org/wiki/Bresenham's_line_algorithm#His...
|
| That definitely didn't have many single cycle instructions.
| Skimming https://ibm-1401.info/A24-6447-0_1401_1460_Instruction
| _and_T..., I couldn't find any.
|
| Certainly, in the era of 8-bit CPUs like Z80 and 6502
| programmers would have been lyric about "1 instruction per
| cycle"
|
| Actually, did any CPU ever "execute 1 instruction per cycle
| without pipelining or prediction" (or, slightly looser "had
| fixed time instructions without pipelining or prediction")?
|
| RISC introduced fixed time instructions, but also pipelining.
| owisd wrote:
| Probably meant 'cycle' in the sense of instruction cycle,
| rather than clock cycle.
| Someone wrote:
| Even then, few CPUs execute all instructions in the same
| amount of time. Division, for example, still takes longer
| than a register move on most architectures. Multiplication,
| historically, did too.
|
| As a sibling comment points out, DSPs can be an exception.
| I'm far from an expert on them, but CPUs that try to run
| all instructions in the same fixed amount of time used to
| accomplish that by avoiding complex addressing modes and by
| omitting the really slow instructions (division and
| instructions that push/pop multiple registers onto/from the
| stack being prime examples).
|
| IIRC, some of them also accomplished it by running some of
| the simpler instructions slower than necessary (raw speed
| isn't as important as being hard real time isn many
| applications)
| kragen wrote:
| > _few CPUs execute all instructions in the same amount
| of time. Division, for example, still takes longer than a
| register move on most architectures_
|
| easy solution, as you allude to, don't have a division
| instruction! arm doesn't, 8048 doesn't, sparcv7 doesn't,
| even the cdc 6600 and cray-1 didn't, and even risc-v
| finally got zmmul:
| https://wiki.riscv.org/display/HOME/Zmmul. it's not just
| dsps
|
| the big issue with complex addressing modes is i think
| fault handling. if your nice orthogonal _add_ instruction
| updates two registers as it reads operands and a third
| register with the sum of the operands, what do you do if
| you get a page fault on the second operand? if the os can
| service the page fault, how does it restart the
| instruction?
|
| as you point out, real-time latency is also an issue. on
| older arm chips you have to be careful not to ldm or stm
| too many registers in a single instruction so as not to
| damage interrupt latency. newer arm chips can restart the
| ldm/stm
|
| i'm no expert on the area either
| chillingeffect wrote:
| Adsp-218x family of harvard architecture DSPs. Each 25 ns
| clock cycle = 1 MAC, 1 data fetch, and one instruction fetch.
| All instructions 1 cycle long. And many other gizmos like
| reversible address bit ranges for DFT in place, separate
| register bank for IRQ handlers. And all laid out by hand.
| buescher wrote:
| I remember the bragging point on the RTX2000 in the very late
| eighties was "a MIP per megahertz".
| kragen wrote:
| my limited experience with stack machines is that a stack
| mip is about half a register-machine mip :-(
|
| consider this simple assembly-language subroutine i wrote
| in october (http://canonical.org/~kragen/sw/dev3/tetris.S,
| screencast at https://asciinema.org/a/622461):
| @@ Set sleep for iwait to r0 milliseconds.
| @@ (r0 must be under 1000) .thumb_func
| waitis: ldr r2, =wait @ struct timeval
| movs r3, #0 @ 0 sec str r3,
| [r2] @ .tv_sec = 0 ldr r1, =1000
| @ multiplier for ms mul r0, r1
| str r0, [r2, #4] @ set .tv_usec bx
| lr .bss wait: .fill 8
| @ the struct timeval .text
|
| these are all perfectly normal register-machine
| instructions; you could translate them one-to-one to almost
| any register machine. on a few of them you could drop one,
| writing something like str #0, [wait]. the whole function
| is straight-line execution, a single basic block, and seven
| instructions long. this is almost a best case for a stack
| machine; it looks like this: lit(0)
| ; load immediate constant lea(wait) ; load
| address of wait onto stack ! ; store 0 in
| wait lit(1000) ; another constant *
| ; multiply argument by constant lea(wait) ; load
| address again lit(4) ; offset of .tv_usec
| + ; calculate address of wait.tv_usec !
| ; store product in tv_usec ret ; return
| from subroutine
|
| that's 10 instructions, about 50% longer than the 6 or 7 of
| the two-operand register machine. but the typical case is
| worse. and basically the reason is that the average number
| of stack manipulations or constant pushes that you need to
| do to get the right operands on top of the stack for your
| memory accesses and computational operations is roughly 1.
| sometimes you'll have a * or ! (store) or + that's not
| preceded by a stack manipulation or a load-immediate or a
| load-address operation, and that time the stack machine
| wins, but other times it'll be preceded by two or three of
| them, and that time the stack machine loses
|
| the rtx-2000 did some tricks to sometimes do more than a
| single stack operation per cycle, but it didn't really
| escape from this
|
| i don't remember if dr. ting wrote a book about the
| rtx-2000, but he did write a very nice book about its
| predecessor the nc4000, going into some of these tricks:
| https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf
| kragen wrote:
| no, you're right, you almost always need pipelining to get
| one instruction per clock cycle
|
| but there are a _lot_ of cpus out there--maybe the majority
| now--that are pipelined microarchitectures that get one
| instruction per cycle without much or any prediction. avrs,
| most of the cortex-m* (all?), most modern implementations of
| old slow isas like the z80 and 8051, etc. big processors like
| your laptop cpu and cellphone cpu are of course superscalar,
| but they are a tiny minority of all processors. even inside
| the cellphone case, they 're outnumbered by in-order scalar
| microcontrollers
|
| without prediction, of course, you have a pipeline bubble
| every time you have a branch, so you never quite hit 1 ipc
| with scalar execution. but it's usually pretty close, and
| even with prediction, sometimes you miss. and usually if you
| have branch prediction, you also have a cache, because ain't
| nobody got time to share one triflin memory bus between
| instructions and data
|
| so pipelining gets you from, say, 3 clocks per instruction
| down to 1.2 or so. then prediction gets you from 1.2 down to,
| say, 1.02
| _0ffh wrote:
| Nah, while cycles/instruction where indeed fixed in those days
| (and for some time yet to come), it was not necessarily 1 cycle
| but rather depended on the instruction.
| tengwar2 wrote:
| Indeed. I used this algorithm on OS/2 1.0 as part of a GUI
| (OS/2 did not have a GUI until 1.1). That was on an 80386.
| MASM came with a nice ring-bound A6 book which summarised the
| instruction set, including timings. I seem to remember that 3
| cycles was normal for a short instruction, but many were
| considerably longer.
| amelius wrote:
| Sounds like your algorithm is from a bygone era where power was
| not important ;)
| corysama wrote:
| If you can get the work done fast and hurry the processor
| back into a low-power sleep state ASAP, it can actually be
| power efficient too!
| phire wrote:
| 1 instruction per cycle? No, that's only possible with
| pipelining.
|
| We are talking about instructions which took 8-12 cycles to
| complete.
| kragen wrote:
| usually this is correct, but there are some exceptions. most
| instructions on a non-pipelined but synchronous stack machine
| like the mup21 take a single cycle, for example
|
| even with a register file, it isn't really inherent that you
| need to decode inputs, do alu operations, and write outputs
| in separate clock cycles; you can do all of that in
| combinational logic except writing the outputs, and you can
| even decode which register to write the output to. it just
| means your max clock rate is in the toilet
|
| for that kind of thing a harvard architecture is pretty
| useful; it allows you to read an instruction in instruction
| memory at the same time you're reading or writing data in
| data memory, instead of in two separate cycles
| chiph wrote:
| I think it depends. I had Dr. Bresenham as my graphics
| instructor (he taught for a while after he retired from IBM)
| and the class used Borland Turbo Pascal and for it's time it
| was fast. Not as fast as raw assembly. But faster than Borlands
| Turbo C that had just come out.
|
| So far as 1 instruction per cycle - Wikipedia says the 80286
| (the top dog PC processor of the time) could execute 0.21
| "typical" instructions per clock. Optimized code was up to 0.5
| instructions per clock. And that agrees with my memories.
|
| Today, I would try and use parallelism if I could. With lots of
| conditions though - is the process being applied to the image
| trivially parallelizable, will most/all of it fit in cache,
| etc. Trying to parallelize Bresenham's algorithms though would
| be futile - when drawing circles you can reflect it into the
| different quadrants (big savings) but the algorithm itself is
| going to be pretty serial because it has to keep up with the
| error coefficient as it draws.
| brcmthrowaway wrote:
| Got an exsmple?
| bsenftner wrote:
| My computer graphics professor back in the early 80's was Prof.
| Glen Bresenham, but not The Bresenham. It was a lot of fun being
| at SIGGRAPH back then and watching people freak upon reading his
| name badge. He'd let them believe for a bit, and then explain
| it's not him. Al Acorn was one that freaked, and that was fun.
| possiblywrong wrote:
| > Note that if F(x,y)=0, then the point (x,y) is exactly on the
| circle. If F(x,y)>0, then the point is outside of the circle, and
| if F(x,y)<0 then the point is inside of it. _In other words,
| given any point (x,y), F(x,y) is the distance from the true
| circle line_ [my emphasis].
|
| This last is not quite true. The exact distance from the circle,
| call it G(x,y), is the corresponding difference of square roots,
| i.e., def G(x, y, r): return math.sqrt(x
| * x + y * y) - math.sqrt(r * r)
|
| and G(x,y) isn't just the square root of F(x,y), and indeed
| doesn't behave monotonically with respect to F(x,y).
|
| It's an interest property of Bresenham's algorithm, that I've
| never seen even stated let alone proved in the literature, that
| this doesn't matter, and the algorithm is indeed exact in the
| sense that it always chooses the next point based on which is
| _truly_ closest to the circle... despite using an error function
| that is only an approximation.
| ericyd wrote:
| A cool write up with an approachable explanation of algorithm
| optimization! I wonder how long it would take me to arrive at
| this algorithm left to my own devices...
___________________________________________________________________
(page generated 2024-08-31 23:01 UTC)