thtml.3 - plan9port - [fork] Plan 9 from user space
 (HTM) git clone git://src.adamsgaard.dk/plan9port
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       thtml.3 (29268B)
       ---
            1 .TH HTML 3
            2 .SH NAME
            3 parsehtml,
            4 printitems,
            5 validitems,
            6 freeitems,
            7 freedocinfo,
            8 dimenkind,
            9 dimenspec,
           10 targetid,
           11 targetname,
           12 fromStr,
           13 toStr
           14 \- HTML parser
           15 .SH SYNOPSIS
           16 .nf
           17 .PP
           18 .ft L
           19 #include <u.h>
           20 #include <libc.h>
           21 #include <html.h>
           22 .ft P
           23 .PP
           24 .ta \w'\fLToken* 'u
           25 .B
           26 Item*        parsehtml(uchar* data, int datalen, Rune* src, int mtype,
           27 .B
           28         int chset, Docinfo** pdi)
           29 .PP
           30 .B
           31 void        printitems(Item* items, char* msg)
           32 .PP
           33 .B
           34 int        validitems(Item* items)
           35 .PP
           36 .B
           37 void        freeitems(Item* items)
           38 .PP
           39 .B
           40 void        freedocinfo(Docinfo* d)
           41 .PP
           42 .B
           43 int        dimenkind(Dimen d)
           44 .PP
           45 .B
           46 int        dimenspec(Dimen d)
           47 .PP
           48 .B
           49 int        targetid(Rune* s)
           50 .PP
           51 .B
           52 Rune*        targetname(int targid)
           53 .PP
           54 .B
           55 uchar*        fromStr(Rune* buf, int n, int chset)
           56 .PP
           57 .B
           58 Rune*        toStr(uchar* buf, int n, int chset)
           59 .SH DESCRIPTION
           60 .PP
           61 This library implements a parser for HTML 4.0 documents.
           62 The parsed HTML is converted into an intermediate representation that
           63 describes how the formatted HTML should be laid out.
           64 .PP
           65 .I Parsehtml
           66 parses an entire HTML document contained in the buffer
           67 .I data
           68 and having length
           69 .IR datalen .
           70 The URL of the document should be passed in as
           71 .IR src .
           72 .I Mtype
           73 is the media type of the document, which should be either
           74 .B TextHtml
           75 or
           76 .BR TextPlain .
           77 The character set of the document is described in
           78 .IR chset ,
           79 which can be one of
           80 .BR US_Ascii ,
           81 .BR ISO_8859_1 ,
           82 .B UTF_8
           83 or
           84 .BR Unicode .
           85 The return value is a linked list of
           86 .B Item
           87 structures, described in detail below.
           88 As a side effect, 
           89 .BI * pdi
           90 is set to point to a newly created
           91 .B Docinfo
           92 structure, containing information pertaining to the entire document.
           93 .PP
           94 The library expects two allocation routines to be provided by the
           95 caller,
           96 .B emalloc
           97 and
           98 .BR erealloc .
           99 These routines are analogous to the standard malloc and realloc routines,
          100 except that they should not return if the memory allocation fails.
          101 In addition,
          102 .B emalloc
          103 is required to zero the memory.
          104 .PP
          105 For debugging purposes,
          106 .I printitems
          107 may be called to display the contents of an item list; individual items may
          108 be printed using the
          109 .B %I
          110 print verb, installed on the first call to
          111 .IR parsehtml .
          112 .I validitems
          113 traverses the item list, checking that all of the pointers are valid.
          114 It returns
          115 .B 1
          116 is everything is ok, and
          117 .B 0
          118 if an error was found.
          119 Normally, one would not call these routines directly.
          120 Instead, one sets the global variable
          121 .I dbgbuild
          122 and the library calls them automatically.
          123 One can also set
          124 .IR warn ,
          125 to cause the library to print a warning whenever it finds a problem with the
          126 input document, and
          127 .IR dbglex ,
          128 to print debugging information in the lexer.
          129 .PP
          130 When an item list is finished with, it should be freed with
          131 .IR freeitems .
          132 Then,
          133 .I freedocinfo
          134 should be called on the pointer returned in
          135 .BI * pdi\f1.
          136 .PP
          137 .I Dimenkind
          138 and
          139 .I dimenspec
          140 are provided to interpret the
          141 .B Dimen
          142 type, as described in the section
          143 .IR "Dimension Specifications" .
          144 .PP
          145 Frame target names are mapped to integer ids via a global, permanent mapping.
          146 To find the value for a given name, call
          147 .IR targetid ,
          148 which allocates a new id if the name hasn't been seen before.
          149 The name of a given, known id may be retrieved using
          150 .IR targetname .
          151 The library predefines
          152 .BR FTtop ,
          153 .BR FTself ,
          154 .B FTparent
          155 and
          156 .BR FTblank .
          157 .PP
          158 The library handles all text as Unicode strings (type
          159 .BR Rune* ).
          160 Character set conversion is provided by
          161 .I fromStr
          162 and
          163 .IR toStr .
          164 .I FromStr
          165 takes
          166 .I n
          167 Unicode characters from
          168 .I buf
          169 and converts them to the character set described by
          170 .IR chset .
          171 .I ToStr
          172 takes
          173 .I n
          174 bytes from
          175 .IR buf ,
          176 interpretted as belonging to character set
          177 .IR chset ,
          178 and converts them to a Unicode string.
          179 Both routines null-terminate the result, and use
          180 .B emalloc
          181 to allocate space for it.
          182 .SS Items
          183 The return value of
          184 .I parsehtml
          185 is a linked list of variant structures,
          186 with the generic portion described by the following definition:
          187 .PP
          188 .EX
          189 .ta 6n +\w'Genattr* 'u
          190 typedef struct Item Item;
          191 struct Item
          192 {
          193         Item*        next;
          194         int        width;
          195         int        height;
          196         int        ascent;
          197         int        anchorid;
          198         int        state;
          199         Genattr*        genattr;
          200         int        tag;
          201 };
          202 .EE
          203 .PP
          204 The field
          205 .B next
          206 points to the successor in the linked list of items, while
          207 .BR width ,
          208 .BR height ,
          209 and
          210 .B ascent
          211 are intended for use by the caller as part of the layout process.
          212 .BR Anchorid ,
          213 if non-zero, gives the integer id assigned by the parser to the anchor that
          214 this item is in (see section
          215 .IR Anchors ).
          216 .B State
          217 is a collection of flags and values described as follows:
          218 .PP
          219 .EX
          220 .ta 6n +\w'IFindentshift = 'u
          221 enum
          222 {
          223         IFbrk =        0x80000000,
          224         IFbrksp =        0x40000000,
          225         IFnobrk =        0x20000000,
          226         IFcleft =        0x10000000,
          227         IFcright =        0x08000000,
          228         IFwrap =        0x04000000,
          229         IFhang =        0x02000000,
          230         IFrjust =        0x01000000,
          231         IFcjust =        0x00800000,
          232         IFsmap =        0x00400000,
          233         IFindentshift =        8,
          234         IFindentmask =        (255<<IFindentshift),
          235         IFhangmask =        255
          236 };
          237 .EE
          238 .PP
          239 .B IFbrk
          240 is set if a break is to be forced before placing this item.
          241 .B IFbrksp
          242 is set if a 1 line space should be added to the break (in which case
          243 .B IFbrk
          244 is also set).
          245 .B IFnobrk
          246 is set if a break is not permitted before the item.
          247 .B IFcleft
          248 is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
          249 before this item is placed, and
          250 .B IFcright
          251 is set for right floats.
          252 In both cases, IFbrk is also set.
          253 .B IFwrap
          254 is set if the line containing this item is allowed to wrap.
          255 .B IFhang
          256 is set if this item hangs into the left indent.
          257 .B IFrjust
          258 is set if the line containing this item should be right justified,
          259 and
          260 .B IFcjust
          261 is set for center justified lines.
          262 .B IFsmap
          263 is used to indicate that an image is a server-side map.
          264 The low 8 bits, represented by
          265 .BR IFhangmask ,
          266 indicate the current hang into left indent, in tenths of a tabstop.
          267 The next 8 bits, represented by
          268 .B IFindentmask
          269 and
          270 .BR IFindentshift ,
          271 indicate the current indent in tab stops.
          272 .PP
          273 The field
          274 .B genattr
          275 is an optional pointer to an auxiliary structure, described in the section
          276 .IR "Generic Attributes" .
          277 .PP
          278 Finally,
          279 .B tag
          280 describes which variant type this item has.
          281 It can have one of the values
          282 .BR Itexttag ,
          283 .BR Iruletag ,
          284 .BR Iimagetag ,
          285 .BR Iformfieldtag ,
          286 .BR Itabletag ,
          287 .B Ifloattag
          288 or
          289 .BR Ispacertag .
          290 For each of these values, there is an additional structure defined, which
          291 includes Item as an unnamed initial substructure, and then defines additional
          292 fields.
          293 .PP
          294 Items of type
          295 .B Itexttag
          296 represent a piece of text, using the following structure:
          297 .PP
          298 .EX
          299 .ta 6n +\w'Rune* 'u
          300 struct Itext
          301 {
          302         Item;
          303         Rune*        s;
          304         int        fnt;
          305         int        fg;
          306         uchar        voff;
          307         uchar        ul;
          308 };
          309 .EE
          310 .PP
          311 Here
          312 .B s
          313 is a null-terminated Unicode string of the actual characters making up this text item,
          314 .B fnt
          315 is the font number (described in the section
          316 .IR "Font Numbers" ),
          317 and
          318 .B fg
          319 is the RGB encoded color for the text.
          320 .B Voff
          321 measures the vertical offset from the baseline; subtract
          322 .B Voffbias
          323 to get the actual value (negative values represent a displacement down the page).
          324 The field
          325 .B ul
          326 is the underline style:
          327 .B ULnone
          328 if no underline,
          329 .B ULunder
          330 for conventional underline, and
          331 .B ULmid
          332 for strike-through.
          333 .PP
          334 Items of type
          335 .B Iruletag
          336 represent a horizontal rule, as follows:
          337 .PP
          338 .EX
          339 .ta 6n +\w'Dimen 'u
          340 struct Irule
          341 {
          342         Item;
          343         uchar        align;
          344         uchar        noshade;
          345         int        size;
          346         Dimen        wspec;
          347 };
          348 .EE
          349 .PP
          350 Here
          351 .B align
          352 is the alignment specification (described in the corresponding section),
          353 .B noshade
          354 is set if the rule should not be shaded,
          355 .B size
          356 is the height of the rule (as set by the size attribute),
          357 and
          358 .B wspec
          359 is the desired width (see section
          360 .IR "Dimension Specifications" ).
          361 .PP
          362 Items of type
          363 .B Iimagetag
          364 describe embedded images, for which the following structure is defined:
          365 .PP
          366 .EX
          367 .ta 6n +\w'Iimage* 'u
          368 struct Iimage
          369 {
          370         Item;
          371         Rune*        imsrc;
          372         int        imwidth;
          373         int        imheight;
          374         Rune*        altrep;
          375         Map*        map;
          376         int        ctlid;
          377         uchar        align;
          378         uchar        hspace;
          379         uchar        vspace;
          380         uchar        border;
          381         Iimage*        nextimage;
          382 };
          383 .EE
          384 .PP
          385 Here
          386 .B imsrc
          387 is the URL of the image source,
          388 .B imwidth
          389 and
          390 .BR imheight ,
          391 if non-zero, contain the specified width and height for the image,
          392 and
          393 .B altrep
          394 is the text to use as an alternative to the image, if the image is not displayed.
          395 .BR Map ,
          396 if set, points to a structure describing an associated client-side image map.
          397 .B Ctlid
          398 is reserved for use by the application, for handling animated images.
          399 .B Align
          400 encodes the alignment specification of the image.
          401 .B Hspace
          402 contains the number of pixels to pad the image with on either side, and
          403 .B Vspace
          404 the padding above and below.
          405 .B Border
          406 is the width of the border to draw around the image.
          407 .B Nextimage
          408 points to the next image in the document (the head of this list is
          409 .BR Docinfo.images ).
          410 .PP
          411 For items of type
          412 .BR Iformfieldtag ,
          413 the following structure is defined:
          414 .PP
          415 .EX
          416 .ta 6n +\w'Formfield* 'u
          417 struct Iformfield
          418 {
          419         Item;
          420         Formfield*        formfield;
          421 };
          422 .EE
          423 .PP
          424 This adds a single field,
          425 .BR formfield ,
          426 which points to a structure describing a field in a form, described in section
          427 .IR Forms .
          428 .PP
          429 For items of type
          430 .BR Itabletag ,
          431 the following structure is defined:
          432 .PP
          433 .EX
          434 .ta 6n +\w'Table* 'u
          435 struct Itable
          436 {
          437         Item;
          438         Table*        table;
          439 };
          440 .EE
          441 .PP
          442 .B Table
          443 points to a structure describing the table, described in the section
          444 .IR Tables .
          445 .PP
          446 For items of type
          447 .BR Ifloattag ,
          448 the following structure is defined:
          449 .PP
          450 .EX
          451 .ta 6n +\w'Ifloat* 'u
          452 struct Ifloat
          453 {
          454         Item;
          455         Item*        item;
          456         int        x;
          457         int        y;
          458         uchar        side;
          459         uchar        infloats;
          460         Ifloat*        nextfloat;
          461 };
          462 .EE
          463 .PP
          464 The
          465 .B item
          466 points to a single item (either a table or an image) that floats (the text of the
          467 document flows around it), and
          468 .B side
          469 indicates the margin that this float sticks to; it is either
          470 .B ALleft
          471 or
          472 .BR ALright .
          473 .B X
          474 and
          475 .B y
          476 are reserved for use by the caller; these are typically used for the coordinates
          477 of the top of the float.
          478 .B Infloats
          479 is used by the caller to keep track of whether it has placed the float.
          480 .B Nextfloat
          481 is used by the caller to link together all of the floats that it has placed.
          482 .PP
          483 For items of type
          484 .BR Ispacertag ,
          485 the following structure is defined:
          486 .PP
          487 .EX
          488 .ta 6n +\w'Item; 'u
          489 struct Ispacer
          490 {
          491         Item;
          492         int        spkind;
          493 };
          494 .EE
          495 .PP
          496 .B Spkind
          497 encodes the kind of spacer, and may be one of
          498 .B ISPnull
          499 (zero height and width),
          500 .B ISPvline
          501 (takes on height and ascent of the current font),
          502 .B ISPhspace
          503 (has the width of a space in the current font) and
          504 .B ISPgeneral
          505 (for all other purposes, such as between markers and lists).
          506 .SS Generic Attributes
          507 .PP
          508 The genattr field of an item, if non-nil, points to a structure that holds
          509 the values of attributes not specific to any particular
          510 item type, as they occur on a wide variety of underlying HTML tags.
          511 The structure is as follows:
          512 .PP
          513 .EX
          514 .ta 6n +\w'SEvent* 'u
          515 typedef struct Genattr Genattr;
          516 struct Genattr
          517 {
          518         Rune*        id;
          519         Rune*        class;
          520         Rune*        style;
          521         Rune*        title;
          522         SEvent*        events;
          523 };
          524 .EE
          525 .PP
          526 Fields
          527 .BR id ,
          528 .BR class ,
          529 .B style
          530 and
          531 .BR title ,
          532 when non-nil, contain values of correspondingly named attributes of the HTML tag
          533 associated with this item.
          534 .B Events
          535 is a linked list of events (with corresponding scripted actions) associated with the item:
          536 .PP
          537 .EX
          538 .ta 6n +\w'SEvent* 'u
          539 typedef struct SEvent SEvent;
          540 struct SEvent
          541 {
          542         SEvent*        next;
          543         int        type;
          544         Rune*        script;
          545 };
          546 .EE
          547 .PP
          548 Here,
          549 .B next
          550 points to the next event in the list,
          551 .B type
          552 is one of
          553 .BR SEonblur ,
          554 .BR SEonchange ,
          555 .BR SEonclick ,
          556 .BR SEondblclick ,
          557 .BR SEonfocus ,
          558 .BR SEonkeypress ,
          559 .BR SEonkeyup ,
          560 .BR SEonload ,
          561 .BR SEonmousedown ,
          562 .BR SEonmousemove ,
          563 .BR SEonmouseout ,
          564 .BR SEonmouseover ,
          565 .BR SEonmouseup ,
          566 .BR SEonreset ,
          567 .BR SEonselect ,
          568 .B SEonsubmit
          569 or
          570 .BR SEonunload ,
          571 and
          572 .B script
          573 is the text of the associated script.
          574 .SS Dimension Specifications
          575 .PP
          576 Some structures include a dimension specification, used where
          577 a number can be followed by a
          578 .B %
          579 or a
          580 .B *
          581 to indicate
          582 percentage of total or relative weight.
          583 This is encoded using the following structure:
          584 .PP
          585 .EX
          586 .ta 6n +\w'int 'u
          587 typedef struct Dimen Dimen;
          588 struct Dimen
          589 {
          590         int        kindspec;
          591 };
          592 .EE
          593 .PP
          594 Separate kind and spec values are extracted using
          595 .I dimenkind
          596 and
          597 .IR dimenspec .
          598 .I Dimenkind
          599 returns one of
          600 .BR Dnone ,
          601 .BR Dpixels ,
          602 .B Dpercent
          603 or
          604 .BR Drelative .
          605 .B Dnone
          606 means that no dimension was specified.
          607 In all other cases,
          608 .I dimenspec
          609 should be called to find the absolute number of pixels, the percentage of total,
          610 or the relative weight.
          611 .SS Background Specifications
          612 .PP
          613 It is possible to set the background of the entire document, and also
          614 for some parts of the document (such as tables).
          615 This is encoded as follows:
          616 .PP
          617 .EX
          618 .ta 6n +\w'Rune* 'u
          619 typedef struct Background Background;
          620 struct Background
          621 {
          622         Rune*        image;
          623         int        color;
          624 };
          625 .EE
          626 .PP
          627 .BR Image ,
          628 if non-nil, is the URL of an image to use as the background.
          629 If this is nil,
          630 .B color
          631 is used instead, as the RGB value for a solid fill color.
          632 .SS Alignment Specifications
          633 .PP
          634 Certain items have alignment specifiers taken from the following
          635 enumerated type:
          636 .PP
          637 .EX
          638 .ta 6n
          639 enum
          640 {
          641         ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
          642         ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
          643 };
          644 .EE
          645 .PP
          646 These values correspond to the various alignment types named in the HTML 4.0
          647 standard.
          648 If an item has an alignment of
          649 .B ALleft
          650 or
          651 .BR ALright ,
          652 the library automatically encapsulates it inside a float item.
          653 .PP
          654 Tables, and the various rows, columns and cells within them, have a more
          655 complex alignment specification, composed of separate vertical and
          656 horizontal alignments:
          657 .PP
          658 .EX
          659 .ta 6n +\w'uchar 'u
          660 typedef struct Align Align;
          661 struct Align
          662 {
          663         uchar        halign;
          664         uchar        valign;
          665 };
          666 .EE
          667 .PP
          668 .B Halign
          669 can be one of
          670 .BR ALnone ,
          671 .BR ALleft ,
          672 .BR ALcenter ,
          673 .BR ALright ,
          674 .B ALjustify
          675 or
          676 .BR ALchar .
          677 .B Valign
          678 can be one of
          679 .BR ALnone ,
          680 .BR ALmiddle ,
          681 .BR ALbottom ,
          682 .BR ALtop
          683 or
          684 .BR ALbaseline .
          685 .SS Font Numbers
          686 .PP
          687 Text items have an associated font number (the
          688 .B fnt
          689 field), which is encoded as
          690 .BR style*NumSize+size .
          691 Here,
          692 .B style
          693 is one of
          694 .BR FntR ,
          695 .BR FntI ,
          696 .B FntB
          697 or
          698 .BR FntT ,
          699 for roman, italic, bold and typewriter font styles, respectively, and size is
          700 .BR Tiny ,
          701 .BR Small ,
          702 .BR Normal ,
          703 .B Large
          704 or
          705 .BR Verylarge .
          706 The total number of possible font numbers is
          707 .BR NumFnt ,
          708 and the default font number is
          709 .B DefFnt
          710 (which is roman style, normal size).
          711 .SS Document Info
          712 .PP
          713 Global information about an HTML page is stored in the following structure:
          714 .PP
          715 .EX
          716 .ta 6n +\w'DestAnchor* 'u
          717 typedef struct Docinfo Docinfo;
          718 struct Docinfo
          719 {
          720         // stuff from HTTP headers, doc head, and body tag
          721         Rune*        src;
          722         Rune*        base;
          723         Rune*        doctitle;
          724         Background        background;
          725         Iimage*        backgrounditem;
          726         int        text;
          727         int        link;
          728         int        vlink;
          729         int        alink;
          730         int        target;
          731         int        chset;
          732         int        mediatype;
          733         int        scripttype;
          734         int        hasscripts;
          735         Rune*        refresh;
          736         Kidinfo*        kidinfo;
          737         int        frameid;
          738 
          739         // info needed to respond to user actions
          740         Anchor*        anchors;
          741         DestAnchor*        dests;
          742         Form*        forms;
          743         Table*        tables;
          744         Map*        maps;
          745         Iimage*        images;
          746 };
          747 .EE
          748 .PP
          749 .B Src
          750 gives the URL of the original source of the document,
          751 and
          752 .B base
          753 is the base URL.
          754 .B Doctitle
          755 is the document's title, as set by a
          756 .B <title>
          757 element.
          758 .B Background
          759 is as described in the section
          760 .IR "Background Specifications" ,
          761 and
          762 .B backgrounditem
          763 is set to be an image item for the document's background image (if given as a URL),
          764 or else nil.
          765 .B Text
          766 gives the default foregound text color of the document,
          767 .B link
          768 the unvisited hyperlink color,
          769 .B vlink
          770 the visited hyperlink color, and
          771 .B alink
          772 the color for highlighting hyperlinks (all in 24-bit RGB format).
          773 .B Target
          774 is the default target frame id.
          775 .B Chset
          776 and
          777 .B mediatype
          778 are as for the
          779 .I chset
          780 and
          781 .I mtype
          782 parameters to
          783 .IR parsehtml .
          784 .B Scripttype
          785 is the type of any scripts contained in the document, and is always
          786 .BR TextJavascript .
          787 .B Hasscripts
          788 is set if the document contains any scripts.
          789 Scripting is currently unsupported.
          790 .B Refresh
          791 is the contents of a
          792 .B "<meta http-equiv=Refresh ...>"
          793 tag, if any.
          794 .B Kidinfo
          795 is set if this document is a frameset (see section
          796 .IR Frames ).
          797 .B Frameid
          798 is this document's frame id.
          799 .PP
          800 .B Anchors
          801 is a list of hyperlinks contained in the document,
          802 and
          803 .B dests
          804 is a list of hyperlink destinations within the page (see the following section for details).
          805 .BR Forms ,
          806 .B tables
          807 and
          808 .B maps
          809 are lists of the various forms, tables and client-side maps contained
          810 in the document, as described in subsequent sections.
          811 .B Images
          812 is a list of all the image items in the document.
          813 .SS Anchors
          814 .PP
          815 The library builds two lists for all of the
          816 .B <a>
          817 elements (anchors) in a document.
          818 Each anchor is assigned a unique anchor id within the document.
          819 For anchors which are hyperlinks (the
          820 .B href
          821 attribute was supplied), the following structure is defined:
          822 .PP
          823 .EX
          824 .ta 6n +\w'Anchor* 'u
          825 typedef struct Anchor Anchor;
          826 struct Anchor
          827 {
          828         Anchor*        next;
          829         int        index;
          830         Rune*        name;
          831         Rune*        href;
          832         int        target;
          833 };
          834 .EE
          835 .PP
          836 .B Next
          837 points to the next anchor in the list (the head of this list is
          838 .BR Docinfo.anchors ).
          839 .B Index
          840 is the anchor id; each item within this hyperlink is tagged with this value
          841 in its
          842 .B anchorid
          843 field.
          844 .B Name
          845 and
          846 .B href
          847 are the values of the correspondingly named attributes of the anchor
          848 (in particular, href is the URL to go to).
          849 .B Target
          850 is the value of the target attribute (if provided) converted to a frame id.
          851 .PP
          852 Destinations within the document (anchors with the name attribute set)
          853 are held in the
          854 .B Docinfo.dests
          855 list, using the following structure:
          856 .PP
          857 .EX
          858 .ta 6n +\w'DestAnchor* 'u
          859 typedef struct DestAnchor DestAnchor;
          860 struct DestAnchor
          861 {
          862         DestAnchor*        next;
          863         int        index;
          864         Rune*        name;
          865         Item*        item;
          866 };
          867 .EE
          868 .PP
          869 .B Next
          870 is the next element of the list,
          871 .B index
          872 is the anchor id,
          873 .B name
          874 is the value of the name attribute, and
          875 .B item
          876 is points to the item within the parsed document that should be considered
          877 to be the destination.
          878 .SS Forms
          879 .PP
          880 Any forms within a document are kept in a list, headed by
          881 .BR Docinfo.forms .
          882 The elements of this list are as follows:
          883 .PP
          884 .EX
          885 .ta 6n +\w'Formfield* 'u
          886 typedef struct Form Form;
          887 struct Form
          888 {
          889         Form*        next;
          890         int        formid;
          891         Rune*        name;
          892         Rune*        action;
          893         int        target;
          894         int        method;
          895         int        nfields;
          896         Formfield*        fields;
          897 };
          898 .EE
          899 .PP
          900 .B Next
          901 points to the next form in the list.
          902 .B Formid
          903 is a serial number for the form within the document.
          904 .B Name
          905 is the value of the form's name or id attribute.
          906 .B Action
          907 is the value of any action attribute.
          908 .B Target
          909 is the value of the target attribute (if any) converted to a frame target id.
          910 .B Method
          911 is one of
          912 .B HGet
          913 or
          914 .BR HPost .
          915 .B Nfields
          916 is the number of fields in the form, and
          917 .B fields
          918 is a linked list of the actual fields.
          919 .PP
          920 The individual fields in a form are described by the following structure:
          921 .PP
          922 .EX
          923 .ta 6n +\w'Formfield* 'u
          924 typedef struct Formfield Formfield;
          925 struct Formfield
          926 {
          927         Formfield*        next;
          928         int        ftype;
          929         int        fieldid;
          930         Form*        form;
          931         Rune*        name;
          932         Rune*        value;
          933         int        size;
          934         int        maxlength;
          935         int        rows;
          936         int        cols;
          937         uchar        flags;
          938         Option*        options;
          939         Item*        image;
          940         int        ctlid;
          941         SEvent*        events;
          942 };
          943 .EE
          944 .PP
          945 Here,
          946 .B next
          947 points to the next field in the list.
          948 .B Ftype
          949 is the type of the field, which can be one of
          950 .BR Ftext ,
          951 .BR Fpassword ,
          952 .BR Fcheckbox ,
          953 .BR Fradio ,
          954 .BR Fsubmit ,
          955 .BR Fhidden ,
          956 .BR Fimage ,
          957 .BR Freset ,
          958 .BR Ffile ,
          959 .BR Fbutton ,
          960 .B Fselect
          961 or
          962 .BR Ftextarea .
          963 .B Fieldid
          964 is a serial number for the field within the form.
          965 .B Form
          966 points back to the form containing this field.
          967 .BR Name ,
          968 .BR value ,
          969 .BR size ,
          970 .BR maxlength ,
          971 .B rows
          972 and
          973 .B cols
          974 each contain the values of corresponding attributes of the field, if present.
          975 .B Flags
          976 contains per-field flags, of which
          977 .B FFchecked
          978 and
          979 .B FFmultiple
          980 are defined.
          981 .B Image
          982 is only used for fields of type
          983 .BR Fimage ;
          984 it points to an image item containing the image to be displayed.
          985 .B Ctlid
          986 is reserved for use by the caller, typically to store a unique id
          987 of an associated control used to implement the field.
          988 .B Events
          989 is the same as the corresponding field of the generic attributes
          990 associated with the item containing this field.
          991 .B Options
          992 is only used by fields of type
          993 .BR Fselect ;
          994 it consists of a list of possible options that may be selected for that
          995 field, using the following structure:
          996 .PP
          997 .EX
          998 .ta 6n +\w'Option* 'u
          999 typedef struct Option Option;
         1000 struct Option
         1001 {
         1002         Option*        next;
         1003         int        selected;
         1004         Rune*        value;
         1005         Rune*        display;
         1006 };
         1007 .EE
         1008 .PP
         1009 .B Next
         1010 points to the next element of the list.
         1011 .B Selected
         1012 is set if this option is to be displayed initially.
         1013 .B Value
         1014 is the value to send when the form is submitted if this option is selected.
         1015 .B Display
         1016 is the string to display on the screen for this option.
         1017 .SS Tables
         1018 .PP
         1019 The library builds a list of all the tables in the document,
         1020 headed by
         1021 .BR Docinfo.tables .
         1022 Each element of this list has the following format:
         1023 .PP
         1024 .EX
         1025 .ta 6n +\w'Tablecell*** 'u
         1026 typedef struct Table Table;
         1027 struct Table
         1028 {
         1029         Table*        next;
         1030         int        tableid;
         1031         Tablerow*        rows;
         1032         int        nrow;
         1033         Tablecol*        cols;
         1034         int        ncol;
         1035         Tablecell*        cells;
         1036         int        ncell;
         1037         Tablecell***        grid;
         1038         Align        align;
         1039         Dimen        width;
         1040         int        border;
         1041         int        cellspacing;
         1042         int        cellpadding;
         1043         Background        background;
         1044         Item*        caption;
         1045         uchar        caption_place;
         1046         Lay*        caption_lay;
         1047         int        totw;
         1048         int        toth;
         1049         int        caph;
         1050         int        availw;
         1051         Token*        tabletok;
         1052         uchar        flags;
         1053 };
         1054 .EE
         1055 .PP
         1056 .B Next
         1057 points to the next element in the list of tables.
         1058 .B Tableid
         1059 is a serial number for the table within the document.
         1060 .B Rows
         1061 is an array of row specifications (described below) and
         1062 .B nrow
         1063 is the number of elements in this array.
         1064 Similarly,
         1065 .B cols
         1066 is an array of column specifications, and
         1067 .B ncol
         1068 the size of this array.
         1069 .B Cells
         1070 is a list of all cells within the table (structure described below)
         1071 and
         1072 .B ncell
         1073 is the number of elements in this list.
         1074 Note that a cell may span multiple rows and/or columns, thus
         1075 .B ncell
         1076 may be smaller than
         1077 .BR nrow*ncol .
         1078 .B Grid
         1079 is a two-dimensional array of cells within the table; the cell
         1080 at row
         1081 .B i
         1082 and column
         1083 .B j
         1084 is
         1085 .BR Table.grid[i][j] .
         1086 A cell that spans multiple rows and/or columns will
         1087 be referenced by
         1088 .B grid
         1089 multiple times, however it will only occur once in
         1090 .BR cells .
         1091 .B Align
         1092 gives the alignment specification for the entire table,
         1093 and
         1094 .B width
         1095 gives the requested width as a dimension specification.
         1096 .BR Border ,
         1097 .B cellspacing
         1098 and
         1099 .B cellpadding
         1100 give the values of the corresponding attributes for the table,
         1101 and
         1102 .B background
         1103 gives the requested background for the table.
         1104 .B Caption
         1105 is a linked list of items to be displayed as the caption of the
         1106 table, either above or below depending on whether
         1107 .B caption_place
         1108 is
         1109 .B ALtop
         1110 or
         1111 .BR ALbottom .
         1112 Most of the remaining fields are reserved for use by the caller,
         1113 except
         1114 .BR tabletok ,
         1115 which is reserved for internal use.
         1116 The type
         1117 .B Lay
         1118 is not defined by the library; the caller can provide its
         1119 own definition.
         1120 .PP
         1121 The
         1122 .B Tablecol
         1123 structure is defined for use by the caller.
         1124 The library ensures that the correct number of these
         1125 is allocated, but leaves them blank.
         1126 The fields are as follows:
         1127 .PP
         1128 .EX
         1129 .ta 6n +\w'Point 'u
         1130 typedef struct Tablecol Tablecol;
         1131 struct Tablecol
         1132 {
         1133         int        width;
         1134         Align        align;
         1135         Point                pos;
         1136 };
         1137 .EE
         1138 .PP
         1139 The rows in the table are specified as follows:
         1140 .PP
         1141 .EX
         1142 .ta 6n +\w'Background 'u
         1143 typedef struct Tablerow Tablerow;
         1144 struct Tablerow
         1145 {
         1146         Tablerow*        next;
         1147         Tablecell*        cells;
         1148         int        height;
         1149         int        ascent;
         1150         Align        align;
         1151         Background        background;
         1152         Point        pos;
         1153         uchar        flags;
         1154 };
         1155 .EE
         1156 .PP
         1157 .B Next
         1158 is only used during parsing; it should be ignored by the caller.
         1159 .B Cells
         1160 provides a list of all the cells in a row, linked through their
         1161 .B nextinrow
         1162 fields (see below).
         1163 .BR Height ,
         1164 .B ascent
         1165 and
         1166 .B pos
         1167 are reserved for use by the caller.
         1168 .B Align
         1169 is the alignment specification for the row, and
         1170 .B background
         1171 is the background to use, if specified.
         1172 .B Flags
         1173 is used by the parser; ignore this field.
         1174 .PP
         1175 The individual cells of the table are described as follows:
         1176 .PP
         1177 .EX
         1178 .ta 6n +\w'Background 'u
         1179 typedef struct Tablecell Tablecell;
         1180 struct Tablecell
         1181 {
         1182         Tablecell*        next;
         1183         Tablecell*        nextinrow;
         1184         int        cellid;
         1185         Item*        content;
         1186         Lay*        lay;
         1187         int        rowspan;
         1188         int        colspan;
         1189         Align        align;
         1190         uchar        flags;
         1191         Dimen        wspec;
         1192         int        hspec;
         1193         Background        background;
         1194         int        minw;
         1195         int        maxw;
         1196         int        ascent;
         1197         int        row;
         1198         int        col;
         1199         Point        pos;
         1200 };
         1201 .EE
         1202 .PP
         1203 .B Next
         1204 is used to link together the list of all cells within a table
         1205 .RB ( Table.cells ),
         1206 whereas
         1207 .B nextinrow
         1208 is used to link together all the cells within a single row
         1209 .RB ( Tablerow.cells ).
         1210 .B Cellid
         1211 provides a serial number for the cell within the table.
         1212 .B Content
         1213 is a linked list of the items to be laid out within the cell.
         1214 .B Lay
         1215 is reserved for the user to describe how these items have
         1216 been laid out.
         1217 .B Rowspan
         1218 and
         1219 .B colspan
         1220 are the number of rows and columns spanned by this cell,
         1221 respectively.
         1222 .B Align
         1223 is the alignment specification for the cell.
         1224 .B Flags
         1225 is some combination of
         1226 .BR TFparsing ,
         1227 .B TFnowrap
         1228 and
         1229 .B TFisth
         1230 or'd together.
         1231 Here
         1232 .B TFparsing
         1233 is used internally by the parser, and should be ignored.
         1234 .B TFnowrap
         1235 means that the contents of the cell should not be
         1236 wrapped if they don't fit the available width,
         1237 rather, the table should be expanded if need be
         1238 (this is set when the nowrap attribute is supplied).
         1239 .B TFisth
         1240 means that the cell was created by the
         1241 .B <th>
         1242 element (rather than the
         1243 .B <td>
         1244 element),
         1245 indicating that it is a header cell rather than a data cell.
         1246 .B Wspec
         1247 provides a suggested width as a dimension specification,
         1248 and
         1249 .B hspec
         1250 provides a suggested height in pixels.
         1251 .B Background
         1252 gives a background specification for the individual cell.
         1253 .BR Minw ,
         1254 .BR maxw ,
         1255 .B ascent
         1256 and
         1257 .B pos
         1258 are reserved for use by the caller during layout.
         1259 .B Row
         1260 and
         1261 .B col
         1262 give the indices of the row and column of the top left-hand
         1263 corner of the cell within the table grid.
         1264 .SS Client-side Maps
         1265 .PP
         1266 The library builds a list of client-side maps, headed by
         1267 .BR Docinfo.maps ,
         1268 and having the following structure:
         1269 .PP
         1270 .EX
         1271 .ta 6n +\w'Rune* 'u
         1272 typedef struct Map Map;
         1273 struct Map
         1274 {
         1275         Map*        next;
         1276         Rune*        name;
         1277         Area*        areas;
         1278 };
         1279 .EE
         1280 .PP
         1281 .B Next
         1282 points to the next element in the list,
         1283 .B name
         1284 is the name of the map (use to bind it to an image), and
         1285 .B areas
         1286 is a list of the areas within the image that comprise the map,
         1287 using the following structure:
         1288 .PP
         1289 .EX
         1290 .ta 6n +\w'Dimen* 'u
         1291 typedef struct Area Area;
         1292 struct Area
         1293 {
         1294         Area*        next;
         1295         int        shape;
         1296         Rune*        href;
         1297         int        target;
         1298         Dimen*        coords;
         1299         int        ncoords;
         1300 };
         1301 .EE
         1302 .PP
         1303 .B Next
         1304 points to the next element in the map's list of areas.
         1305 .B Shape
         1306 describes the shape of the area, and is one of
         1307 .BR SHrect ,
         1308 .B SHcircle
         1309 or
         1310 .BR  SHpoly .
         1311 .B Href
         1312 is the URL associated with this area in its role as
         1313 a hypertext link, and
         1314 .B target
         1315 is the target frame it should be loaded in.
         1316 .B Coords
         1317 is an array of coordinates for the shape, and
         1318 .B ncoords
         1319 is the size of this array (number of elements).
         1320 .SS Frames
         1321 .PP
         1322 If the
         1323 .B Docinfo.kidinfo
         1324 field is set, the document is a frameset.
         1325 In this case, it is typical for
         1326 .I parsehtml
         1327 to return nil, as a document which is a frameset should have no actual
         1328 items that need to be laid out (such will appear only in subsidiary documents).
         1329 It is possible that items will be returned by a malformed document; the caller
         1330 should check for this and free any such items.
         1331 .PP
         1332 The
         1333 .B Kidinfo
         1334 structure itself reflects the fact that framesets can be nested within a document.
         1335 If is defined as follows:
         1336 .PP
         1337 .EX
         1338 .ta 6n +\w'Kidinfo* 'u
         1339 typedef struct Kidinfo Kidinfo;
         1340 struct Kidinfo
         1341 {
         1342         Kidinfo*        next;
         1343         int        isframeset;
         1344 
         1345         // fields for "frame"
         1346         Rune*        src;
         1347         Rune*        name;
         1348         int        marginw;
         1349         int        marginh;
         1350         int        framebd;
         1351         int        flags;
         1352 
         1353         // fields for "frameset"
         1354         Dimen*        rows;
         1355         int        nrows;
         1356         Dimen*        cols;
         1357         int        ncols;
         1358         Kidinfo*        kidinfos;
         1359         Kidinfo*        nextframeset;
         1360 };
         1361 .EE
         1362 .PP
         1363 .B Next
         1364 is only used if this structure is part of a containing frameset; it points to the next
         1365 element in the list of children of that frameset.
         1366 .B Isframeset
         1367 is set when this structure represents a frameset; if clear, it is an individual frame.
         1368 .PP
         1369 Some fields are used only for framesets.
         1370 .B Rows
         1371 is an array of dimension specifications for rows in the frameset, and
         1372 .B nrows
         1373 is the length of this array.
         1374 .B Cols
         1375 is the corresponding array for columns, of length
         1376 .BR ncols .
         1377 .B Kidinfos
         1378 points to a list of components contained within this frameset, each
         1379 of which may be a frameset or a frame.
         1380 .B Nextframeset
         1381 is only used during parsing, and should be ignored.
         1382 .PP
         1383 The remaining fields are used if the structure describes a frame, not a frameset.
         1384 .B Src
         1385 provides the URL for the document that should be initially loaded into this frame.
         1386 Note that this may be a relative URL, in which case it should be interpretted
         1387 using the containing document's URL as the base.
         1388 .B Name
         1389 gives the name of the frame, typically supplied via a name attribute in the HTML.
         1390 If no name was given, the library allocates one.
         1391 .BR Marginw ,
         1392 .B marginh
         1393 and
         1394 .B framebd
         1395 are the values of the marginwidth, marginheight and frameborder attributes, respectively.
         1396 .B Flags
         1397 can contain some combination of the following:
         1398 .B FRnoresize
         1399 (the frame had the noresize attribute set, and the user should not be allowed to resize it),
         1400 .B FRnoscroll
         1401 (the frame should not have any scroll bars),
         1402 .B FRhscroll
         1403 (the frame should have a horizontal scroll bar),
         1404 .B FRvscroll
         1405 (the frame should have a vertical scroll bar),
         1406 .B FRhscrollauto
         1407 (the frame should be automatically given a horizontal scroll bar if its contents
         1408 would not otherwise fit), and
         1409 .B FRvscrollauto
         1410 (the frame gets a vertical scrollbar only if required).
         1411 .SH SOURCE
         1412 .B \*9/src/libhtml
         1413 .SH SEE ALSO
         1414 .MR fmt (1)
         1415 .PP
         1416 W3C World Wide Web Consortium,
         1417 ``HTML 4.01 Specification''.
         1418 .SH BUGS
         1419 The entire HTML document must be loaded into memory before
         1420 any of it can be parsed.