{
// ====================================================================
//  Copyright (c) 2004, D.Sofyan, Adrian Hafizh & Inge DR.
//  Property of PT SOFTINDO Jakarta.
//  All rights reserved.
//
//  unit cxpos, a class extension (+sample implementations) of
//  expos (extended pos), the extremely high performance unit
//  of SubString/Pattern Search & Replace,
//  using the new, proposed Sofyan-Hafizh BoundCheck algorithm,
//  plus assembler (x86) tricks in an advanced delphi programming
//
//  target compiler: Delphi5
//  target CPU: intel 486+ compatible
//              (PentiumTick works only in Pentium+)
//
//  hyperbolic hype:
//  this is the fastest implementation of pattern search algorithm,
//  the replace function is *at-least* 25 times faster than the
//  standard delphi's StringReplace on a very light task, and raised
//  exponentially according to the weight (1000-3000++ times faster
//  on heavy duties ones). note: the number is NOT in percent.
//
//  version: 2.0.1.5
//  based on expos Version: 1.0.2.7 (discontinued) by the same authors,
//  (get the expos instead for fully documented source code)
//  extension:
//    using classes to be thread safe
//    full featured Replace function/procedure
//    added pchar version for raw handling memory/filesize > 2GB
//    added file-based sample implementation
//    ++
//
//  Last update: 2004.12.08
// ====================================================================
//  contacts:
//    zero_inge\\AT\\yahoo\\DOT\\com  ~ should be working
//                                      (as long as yahoo still online)
//  or
//    aa\\AT\\softindo\\DOT\\net      ~ maybe not work
//    http://delphi.formasi.com       ~ maybe no longer work
//    http://delphi.softindo.net      ~ not even yet work
//
//  authors address:
//    Jl. Lima Benua No.23, Ciputat 15411,
//    Banten, INDONESIA
//
//  company address:
//    PT SOFTINDO
//    Jl. Bangka II No.1A,
//    Jakarta 12720, INDONESIA.
// ====================================================================
//
}
//
{
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// credits, benchmark & algorithm ~ excerpted from expos version...
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  ...
  benchmark:
    the thanks goes to Peter Morrise's fastStrings unit (ver-3.2)
    and Martin Waldenburg's mwTSearch class (ver-2.2), also Angus
    Johnson's Search component (ver-2.2), whose ideas (and misses)
    has been examined for the advantages to this unit.

    the TmTSearch by Martin Waldenburg appears to be the fastest
    compared to the others, with an exception as described below.

    the Search component by Angus Johnson is the slowest, however
    we admit that it is the best that we could get in pure pascal.

    the fastrings unit by Peter Morris performance took place
    between the two's above (only a bit slower than the 1st),
    and considered to be the most matured one with many folks
    included within the development.

    this unit is quite lot faster than the other's implementations.
    for a common search, (ie. case-sensitive, length > 1, arbitrary
    characters with some are duplicated), she cuts at least 30~40%
    of the processing time. yet if that is not enough, then by the
    insensitive search, charpos and the *ultimate* case & repeated-
    chars detection integrated in this unit, she could be much far
    beyond. (the last one in effect squashes at *minimum* 3X faster!).

    (that is just a benchmark. of course, it's not fair to
    comparing their implementations with single algorithm
    alone with this acrobatic-mixed algorithm).

  algorithm:
    this unit make use of an algorithm similar to boyer-moore's,
    maximize the efficiency based on understanding of the length-
    integrity-check. the core algorithm is basically similar to
    the algorithm applied by Martin Waldenburg. while martin's
    implementation is faster compared with the other two, she had
    still (at least one obvious) gotcha that she failed to catch
    for a repeated-chars, such as 'EEEE' on (N div 2) * N position,
    where N is the length of pattern to find. when she forced to
    give the correct result (the jump altered, as suggested by
    Martin himself), the performance will degrades significantly
    down to that of Boyer-Moore's implementation by faststrings
    unit of Peter Morrise. note however, it means that the
    altered one is not based on her algorithm anymore.

    nevertheless, even the slowest, still is considered better
    than a *fast* one but with defect.

    actually the algorithm that he had used was (perhaps) broken
    (unfortunaltely we have no access to the source doc itself,
    but according to our examination) it had a subtle flaw
    to pick an arbitrarily anchor index (half-pattern length,
    as she called it, we named it boundcheck).
    Generally her matching algorithm works ONLY for non-repeated
    chars, ie. no duplicate characters found in the SubStr/pattern
    to be matched.

    (for more details please read on our analysis paper about this
    indexed-character-based matching algorithm).
  ...

  excerption end.         Reference: Intel's Pentium Developer's Manual
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
}
{
// ====================================================================
// USAGE:
// ====================================================================

  1. add in unit clause:
       uses cxpos;

  2. declare variable of txSearch:
       var
         tx: txSearch;

  3. create an instance of txSearch:
       tx := txSearch.Create;

     note that there are character based search functions
     which already available & do not need object creation.
     (CharPos & CharCount)

  4. initialize pattern and case option, ie. tells the program
     what substring to be searched for, and whether capitalization
     is important or not. this procedure could be previously done
     at creation stage or delayed later until actual search/replace
     soon to be proceed.
     we called the later method as an IMPLICIT intialization.

     this implicit initialization is quite tight though, the state
     (pattern and case-option) would not be changed if it could be
     detected that the pattern to be matched would never be found
     anyway.

     creation and initialization at once:
       tx := txSearch.Create('Some substrings to be searched for');
       tx := txSearch.Create('another pattern', TRUE); // default = FALSE

     note that once initialized, all subsequent search will be
     using the new/current initialization state

  5. do the actual search:

       i:= tx.Pos(BigString); // default = start from 1
       i:= tx.Pos(BigString, 1969); // start from position 1969

     as mentioned above, we may delay initialization until
     actual search, this also required if we change/start a new
     pattern to be searched for (or change its case option):

       i := tx.pos(BigString, 'pattern', TRUE, 1969)
       i := tx.pos(BigString, 'pattern2') // default = start from 1

  6. miscellaneous sample implementation,
       function wordcount, returns count of the pattern
       function wordcountf, returns count of the pattern within a file
       function Replaced, returns the string of Search/Replace
       procedure Replace, simply a Replaced function wrapper
       FindFirst/Next/Close, search within a file
       FileReplace, replace within a file

  7. function prototypes
     a.for searching a pattern
         function(SubStr, S, StartPos, IgnoreCase)

     b.for searching pattern within a file
         FindInFile(SubStr, FileName, StartPos, IgnoreCase, KeepPriorState)

     c.for search and replace:
         function(S, SubStr, Replacement, StartPos, IgnoreCase, ReplaceAll)

     d.for search and replace within a file
       (actually with the same order and types as above):
         function(FileName, SubStr, Replacement, StartPos, IgnoreCase, ReplaceAll)

       required arguments are S (or FileName) and SubStr, the remaining
       arguments ar optional with default values:
         StartPos = 1
         IgnoreCase = FALSE
         CloseFind = FALSE (close file after file-search proceed)
         ReplaceAll = TRUE

       when initialization has been done, either by calling function
       with complete arguments as above or by explicitly call Init then
       subsequent calls should be without: SubStr & IgnoreCase option

       a'.for searching a pattern:
            function(S, StartPos)

       b'.for searching pattern within a file
           FindFirst(FileName, StartPos)
           FindFrom(StartPos, CloseFind)
           FindNext(CloseFind)
           FindClose

       c'.for search and replace:
           function(S, Replacement, StartPos, ReplaceAll)

       d'.for search and replace within a file
           function(FileName, Replacement, StartPos, ReplaceAll)

       for searching within a file, there is special group of functions
       (for our convenience), namely findFirst, findNext and findClose.
       the findFirst function works as b'. above:

         CloseFind option means to reset the find-state
         (filename, handles, size, position, etc.) of the
         opened file. the state will not be changed, if the
         program could detect earlier that the pattern to be
         matched would not ever been found anyway.

         the same rules also applied for replace in file, for
         example, if we tried to proceed a 0-size file, or the
         file size < pattern-length, or replace by the same pattern
         (with case sensitive on), the program will not dumbly
         continued our request. since file processing is always
         expensive, the program will avoid them as she could.

       actually the findfirst is not limited only to the first
       occurence of pattern, since the StartPos could be given
       in anywhere, whereas the findNext is always get the next
       occurence of pattern (if any) of the previous findFirst
       result.

       those group of functions should have been properly
       initialized before they are used.

       use FindInFile instead to initialize the pattern &
       case-option.

       if KeepPriorState options set to TRUE then the previous
       find-state of the findfirst family (if any) will be
       preserved (this one-shot feature is useful if you have
       to intterupt the ongoing process of findfirst function
       family without disturbing them), otherwise (if it set
       to FALSE) then the previous find-state will be closed
       and changed according to the find-state of successful
       FindInFile (if FindInFile  fails, the previous
       find-state wouldn't be changed).

  -  there are two forms of search: against a String or pointer.
     use the pointer form to search in memory, file buffer, pchar etc.

  -  File-based replace mostly effective for a very large file,
     and the most important is- when the filesize is expected not
     to be changed, ie. if the length of pattern to be searched is
     equal with its replacement, then it should be much faster than
     utilizing temporary string & filestream.
     this should not be a problem for file search only, since it
     opened file in read-oly mode. and the performance ratio shall
     get better according to the increasement of filesize.

  -  stop.
}
{
// ====================================================================
   COPYRIGHT NOTICE:

      usage of this program, or part of it, in any purposes, must
      aknowledge the original authors as mentioned above.

   ... 
   we mean it.
// ====================================================================
}
