This component enables the application to parse HTML files and some other
text files. The main purpose of the component is HTML in some variations (with
ASP or other embedded foreign content for example), still other formats can be
parsed as well, but with some limitations and need of additional configuration.
Creation:
ProgID: newObjects.utilctls.HTMLParser
ClassID: {A253C277-F280-4349-B918-ED94BA6A1A28}
free threaded version
ProgID: newObjects.utilctls.HTMLParser.free
ClassID: {F63A8DFD-B830-4f18-8621-22D8D1AFF366}
Contents:
  Members reference 
  The object tree document structure 
  Searching for nodes 
  Cloning/adding elements 
  Remarks 
    What kind of documents can be parsed? 
    What is it for and what is not for? 
    Auto-closing
  HTML Encoding/Decoding
Members reference
  
    | Member | 
    Syntax | 
    Description | 
  
  
     
      Parse | 
    Set result = obj.Parse(string) | 
    Parses the string as configured and
      returns the tree (see below) representing the document.  | 
  
  
     
      Construct | 
    str = obj.Construct(tree [,bXml]) | 
    The reverse of Parse method. Constructs a
      string from an object tree. | 
  
  
    | Configuration members  | 
  
  
     
      ApplySettings | 
    obj.ApplySettings preset_name | 
    There are a few pre-defined
      configurations which will save you the need to pass though all the
      configuration properties and members: 
      HTML - default HTML parsing 
      HTMLTEMPLATE - HTML but parses only the elements with existing
      attribute named TEMPLATE. The rest of the content is treated as plain
      text. This works faster and is enough for applications interested in
      modifying only certain elements. 
      ASP - Parse only the ASP related elements and embedded code
      segments (As nodes). The rest is treated as plain text. Good for apps.
      which are interested only in separating the code from the static
      content.  
      HTMLASP - HTML + ASP code embeddings. Good for apps. interested in
      both the HTML structure and the code. However the application needs to analyze
      the SCRIPT tags on its own (i.e. check if they have RUNAT attribute. | 
  
  
     
      AddTag | 
    obj.AddTag tag_name [, autoclose] | 
    Adds a tag to the list of the tags that
      are parsed and specifies if it is a self-closing one (default is 0 - not). | 
  
  
     
      RemoveTag | 
    obj.RemoveTag tag_name | 
    Removes a tag from the list of the
      tags.  | 
  
  
     
      RemoveTags | 
    obj.RemoveTags | 
    Removes all the tags from the parse list. | 
  
  
     
      AddStdHTMLTags | 
    obj.AddStdHTMLTags | 
    Adds all the HTML 3.2 tags to the parse
      list | 
  
  
     
      SetSkipTag | 
    obj.SetSkipTag tag_name,
      bSetRemove | 
    Adds/Removes depending on the bSetRemove
      (True - add, False - Remove) a tag to the list of the "skip
      tags". The "skip tags" list may contain only non-selfclosing
      tags. The content inside the tags is not parsed. I.e. the parse searches
      for the closing tag once the opening is found and the internal content is
      treated as plain text. For instance the SCRIPT and the STYLE tags are by
      default skip tags for the HTML configuration. | 
  
  
     
      AddEmbed | 
    obj.AddEmbed start,end,name | 
    Adds an embed definition. start and end
      are strings defining how the embedding starts and how it finishes. The
      name is the name of the embedding by which it can be identified in the
      tree (the Info property of a node). The embeddings are supposed to be
      elements that violate the general tag syntax - such as comments, ASP code
      and so on. The internal content in an embed is treated as plain
      text.  | 
  
  
     
      RemoveEmbed | 
    obj.RemoveEmbed name | 
    Removes an embed from the parse list | 
  
  
     
      RemoveEmbeds | 
    obj.RemoveEmbeds | 
    Clears the embed parse list | 
  
  
     
      codePage | 
    obj.CodePage = x 
      x = obj.CodePage | 
    The code page of the parsed
      content.  | 
  
  
      
      caseSensitive | 
    obj.caseSensitive = boolval 
      x = obj.caseSensitive  | 
    Enables disables case sensitive parsing.
      For instance HTML is parsed case insensitively. | 
  
  
     
      knownTagsOnly | 
    obj.knownTagsOnly = boolval 
      x = obj.knownTagsOnly  | 
    If True only the tags in the list
      (constructed with AddTag or using a pre-set) are parsed - the rest are
      treated as plain text. If False the parser will parse everything that
      looks like a tag assuming it is not self-closing unless it finishes
      with  /> | 
  
  
     
      aspcCompatible | 
    obj.aspcCompatible = boolval 
      x = obj.aspcCompatible | 
    Work in ASP Compiler compatible style.
      This is rarely needed unless you want to use code built for the ASPC
      Compile Time Scripting.  | 
  
  
      
      commentTag | 
    obj.commentTag = string 
      s = obj.commentTag | 
    Sets/returns the comment tag start
      (usually !--). Rarely needed - it is more flexible to use embeds to define
      comments. Still for plain HTML or XML this will work faster. | 
  
  
     
      ignoreUnknownTags | 
    obj.ignoreUnknownTags = boolval 
      x = obj.ignoreUnknownTags | 
    If True any errors while parsing unknown
      tags (knownTagsOnly = False) are ignored and the content is treated as
      plain text. This may help you parse the important part of content which
      includes some elements with tag-like syntax that cannot be understood by
      the parser. | 
  
  
     
      requiredAttribute | 
    obj.requiredAttribute = attr_name 
      x = obj.requiredAttribute | 
    Specifies the name of an attribute that
      must be present in order the element to be included in the result tree. If
      empty all the parsed elements are included. This is used for instance in
      the HTMLTEMPLATE pre-set where only tags which have TEMPLATE attribute are
      included in the tree and everything else is treated as plain text no
      matter if it is HTML or not. 
      This is useful when you are not interested in the entire document
      structure, but in certain elements only. | 
  
  
     
      omitEmptyValues | 
    obj.omitEmptyValues = boolval 
      x = obj.omitEmptyValues | 
    If set to True the Construct will not
      output the empty values of attributes. For example if you have NOWRAP
      somewhere it will appear in the output as: 
      NOWRAP if this property is True 
      and as: 
      NOWRAP="" if this property is False 
      Strongly recommended with HTML | 
  
  
     
      skipEmptyTexts | 
    obj.skipEmptyTexts = boolval 
      x = obj.skipEmptyTexts | 
    Default (false). If set to true the empty
      text/plain areas will not be included in the document tree. Note that this
      is often related to the way the application works with the tree. For
      instance application that depends on this property set to True may assume
      that a node contains only attributes and sub-nodes that are HTML elements.
      Such a code may not be prepared to see text/plain elements between the
      HTML elements and thus may fail if you decide to set this property to
      False later in the application's development.
       Empty text area is area consisting only of spaces, <CR>,
      <LF> and tab characters. By default these areas are included in the
      tree as nodes with node.Info = "text/plain" and single unnamed
      element with string value containing the actual characters met there (for
      example several spaces, new line and then some tabs) . When the property
      is set to true these elements are not present in the tree. Such elements
      can be considered garbage, still they reflect the way the original
      document is formatted and may be important for some applications. Be
      careful to consider the need (current and potential need in future) of
      them before deciding if you will want to scrap them.  | 
  
The Parse method uses VarDictionary
objects to represent the document's tree. Each node has the following structure:
  node.Info - contains the tag name
  node(name_or_index) - contains an element inside this element
  (sub-node) or an attribute. Thus if it is an object it is a sub-node (element
  inside this element) and if it is not an object it is an attribute.
  node.Key(name_or_index) - is the name of the attribute (if
  node(name_or_index) is not an object) or the ID of the sub-element (if
  node(name_or_index) is object). The ID is the value of the ID attribute if the
  element has such - if it has no ID attribute then its Key is empty. I.e. the
  parser treats the ID attribute in a bit specific manner, but this will not
  ruin anything if you are not interested in the ID.
  The plain text segments - content of a tag, non-parsed parts,
  content of the embeds are represented as node same as the nodes for the tags,
  but its node.Info = "text/plain" 
  Thus to change the text somewhere in the document you need to find the
  appropriate text/plain node and change its only value (index 1 or empty name).
  For simplicity you can also change the Root property of the text/plain node
  instead of the first element - this will  produce the same results when
  you use Generate. However, note that only the Generate method checks for the
  Root property and if it is non-empty uses it instead of the first element of
  the text/plain node - and this is so only to enable you to use a bit simpler
  code for changing texts.   
  The content of the parsed data is encapsulated in a node with Info =
  "document/root"
  To illustrate this let us take this HTML file:
  <HTML>
  <HEAD>
    <TITLE>Some title</TITLE>
  </HEAD>
  <BODY>
    <P ALIGN="LEFT">Some text
      <INPUT TYPE="TEXT" NAME="Text1"
  VALUE="Something">
    </P>
    </BODY>
  </HTML>
  Its tree will look as shown below. The letter in the first line is O or V
  to mark it this is an Object or Value. The notation is such that it reflects
  the expression you would need to use to access each node tree if you have
  called the Parse method like this:
  Set root = o.Parse(content)
  Where the content is a string containing the above HTML code.
  O     root .Info="document/root"
O       root(1) .Info="HTML"
O         root(1)(1) .Info="text/plain"
V           root(1)(1)(1)=""
O         root(1)(2) .Info="HEAD"
O           root(1)(2)(1) .Info="text/plain"
V             root(1)(2)(1)(1) = ""
O           root(1)(2)(2) .Info="TITLE"
O             root(1)(2)(2)(1) .Info="text/plain"
V               root(1)(2)(2)(1)(1)="Some title"
O           root(1)(2)(3) .Info="text/plain"
V             root(1)(2)(3)(1)=""
O         root(1)(3) .Info ="text/plain"
V           root(1)(3)(1)=""
O         root(1)(4) .Info="BODY"
O           root(1)(4)(1) .Info="text/plain"
V             root(1)(4)(1)(1)=""
O           root(1)(4)(2) .Info="P"
V             root(1)(4)(2)(1)="LEFT" or root(1)(4)(2)("ALIGN")="LEFT"
O             root(1)(4)(2)(2) .Info="text/plain"
V               root(1)(4)(2)(2)(1)="Some text"
O             root(1)(4)(2)(3) .Info="INPUT"
V               root(1)(4)(2)(3)(1)="TEXT" or root(1)(4)(2)(3)("TYPE")="TEXT"
V               root(1)(4)(2)(3)(2)="Text1" or root(1)(4)(2)(3)("NAME")="Text1"
V               root(1)(4)(2)(3)(3)="Something" or root(1)(4)(2)(3)("VALUE")="Something"
O             root(1)(4)(2)(4) .Info="text/plain"
V               root(1)(4)(2)(4)(1)=""
O           root(1)(4)(3) .Info="text/plain"
V             root(1)(4)(3)(1)=""
O         root(1)(5) .Info="text/plain"
V           root(1)(5)(1)=""
  Except the attributes by default the sub-elements are unnamed elements of
  each node. This is not so only if the sub-element has an ID attribute. In such
  case it will be named after the ID attribute's value in its parent's node
  collection. For example if we add ID="MyTextBox" to the attributes
  of the INPUT in the above example HTML we will be able to access the text box
  element not only as:
  Set theInpur = root(1)(4)(2)(3)
  but also as:
  Set theInpur = root(1)(4)(2)("MyTextBox")
  Each language has ability to test if certain element of the collection an
  object or value. Thus when you need to enumerate only the sub-elements and not
  the attributes of a node you use code like this:
  ' Assuming the above example HTML root(1)(4) is the BODY element.
  For I = 1 To root(1)(4).Count
    If IsObject(root(1)(4)(I)) Then
      ' Do whatever you want to do with each element of the BODY
    End If
  Next
  Of course in the real usage you will not specify the indices as numbers
  directly - instead you will search for an element and then dig in it or
  enumerate the tree and go down the branches you are interested in. 
  When walking the tree you can use over each node statements like If
  node("NAME") = "Somename" to determine the value of
  the name attribute of the element (if it is missing empty will be returned).
  Another frequently used expression will be for example If node(N).Info =
  "A" - after using IsObject over node(N) and receiving True from
  it you learned that it is a sub-node/element - node(N).Info will give you the
  element name - anchor in this sample.
  Still, the walking through the tree requires too much work - for you and
  for the machine. Thus usually such techniques are applied to small parts of it
  - for example enumerating the rows <TR> of a <TABLE> is often
  candidate for this kind of exploration, but after you get the <TABLE> in
  some more efficient way (it can be deep in the document structure and there
  could be many tables). 
  As it was stated above the nodes in the document tree are VarDictionary
  objects over which you can use any of the VarDictionary
  object's method and properties. There are many possible uses of them, but the
  Find methods deserve special attention.
  Using FindByValue:
  
    This method enables you to find recursively all the sub-nodes of a node
    that have a specific value for a specific attribute. For example in the
    example above we may want to find the INPUT by its name. We can do this:
    Set elements = root.FindByValue("NAME","Text1")
    The returned result is again a VarDictionary
    collection that contains references to all the found elements that match the
    criteria. The search is done in depth, thus we can perform the search for an
    element over the entire tree without need to dig certain branch first. Once
    the element is found we can search other more specific elements under it if
    it has sub-elements too 
    In this example case we will have the INPUT object in elements(1)
    Note that FindByValue has optional parameters that specify how much
    elements can be found at most and how deep the search will go - pay
    attention to them in order to refine the search as needed.
    Over the returned result you can perform a cycle and inspect each found
    element for something else, perform some other actions on some or all the
    found elements, add new sub-elements to them or alter, add attributes. For
    example lets create a little code that will add that ID attribute we used
    above to the INPUT in the above example:
    Set elements = root.FindByValue("NAME","Text1")
    For J = 1 To elements.Count
      elements(J)("ID") = "MyTextBox"
    Next
    This code will work even if we have more elements that NAME attribute
    with value "Text1". To all of them an ID="MyTextBox"
    attribute will be appended.
    If it is possible that there are other elements with
    NAME="Text1" in the HTML and we want to be sure that only INPUT
    elements of TYPE="TEXT" will be affected we can change the code
    this way:  
    Set elements = root.FindByValue("NAME","Text1",1,10000)
    For J = 1 To elements.Count
      If elements(J).Info = "INPUT" AND
    elements(J)("TYPE")="TEXT" Then
        elements(J)("ID") = "MyTextBox"
      End If
    Next
  
  Using FindByInfo:
  
    This method searches throughout the tree for elements with Info property
    containing a specified values. Thus we can use it to find all the elements
    of certain type. For example all the paragraphs (P - elements).
    Set elements = root.FindByInfo("P",1,10000)
    Usually a combination of the both methods is the best way to find what we
    search for. For instance if we have many tables we name them somehow (there
    a lot of options - from using ID="tablename" attribute through
    using a TITLE="name" or NAME="somename" attribute to
    using custom non-standard HTML attribute for the purpose. No matter what
    method we use we can do it the same way: Find all tables first then in the
    result search for all the tables with certain attribute and value. For
    example:
    Set tables = docroot.FindByInfo("TABLE",1,10000)
    Set tablesinquestion = tables.FindByValue("TITLE","DataTable",1,10000)
    This assumes that we have some tables marked with attribute TITLE="DataTable"
    and we like to find them and do something special with each such table. For
    instance we may want to fill the table with some data from a database.
    Another attribute may contain the query or other information which will tell
    the application which data to show in that table. See Adding elements below.
  
  Note that all the FindXXXX methods have optional arguments after the
  criterion arguments. They are first found element to return, maximum elements
  to find and search depth. By default only the first found element is returned
  in the resulting collection (if any is found, of course). Thus when we expect
  to find more than one element we set sensible maximum limit or a very big
  limit to allow all the matching elements to be found. If the HTML is big and
  we want to optimize the search can be done in series of several elements by
  changing the first element and the max number of elements. The depth is by
  default unlimited, thus it deserves attention only if optimization is possible
  by specifying lower depth - for example we may know that the tables we search
  for are not deeper than 5 levels in the document tree, this will save
  iterations through the tree for the FindXXXX method as it will not look deeper
  than 5 levels at all.
  The found elements are references to the elements in the tree, thus by
  changing some of their attributes or sub-elements we change the tree at the
  location where the element actually is. 
  The other - FindByName
  method should be used more carefully. In contrast to the other two it may
  return both nodes and attributes in the found collection. Thus if we perform
  Set elements = root.FindByName("TITLE",1,10000) we will receive all
  the TITLE attributes in the document and all the elements that have
  ID="TITLE". This is most often inconvenient unless used over a
  branch where we can be sure that the name searched is used by element ID-s and
  no attributes with that name exist. It is recommended to avoid this method
  except in scenarios where the HTML parser is used to provide some template
  functionality with well known and well-formed element ID-s.
Cloning/adding elements
  One of the nicest features of the VarDictionary
  object is that it can be cloned. Thus in the primary usage of this object -
  HTML templates we can practice creating a complete pre-designed HTML. For
  example put in each table that will be filled with the parser one row with
  specified colors, styles etc. Then we can find the table, get the row, remove
  it from the table and then begin to add rows by cloning the row one time for
  each row we want to add. How this will look?
  Assume we have marked the table with unique ID="ReportTable".
  ' We  load thedoccontent from a file, db or somewhere else
  Set doc = Parser.Parse(thedococntent)
  ' Parse it
  ' We will need often to add text inside elements of the tree.
  ' So it is convenient to create one text/plain node and then
  ' clone it each time we want to add text somewhere
  Set textnode = doc.CreateNew
  textnode.Info = "text/plain"
  ' The same node can be used in all the operations - so if we have more than
  ' the table we discussed we can use the same node for a template. 
  ' Do something else ...
  Set table =  doc.FindByValue("ID","ReportTable")(1)
  ' For brevity we refer directly to the first found element
  ' It it is not there an error (object required) will occur which will
  ' be good-enough indication that the template is corrupted or wrong.
  ' As this will not happen after the development is finished we can save the
  ' more precise error checking.
  Set samplerow = table("TemplateRow")
  ' We assume that the sample row has ID="TemplateRow" attribute
  table.Remove("TemplateRow")
  ' We remove the sample row from the table - we do not want to show empty
  unused rows there
  ' No it is the time to generate rows from some data. For the sake of the
  example let's
  ' suppose we have SQLite database and we query it
  Set data = db.Execute("SELECT * FROM SomeTable")
  For r = 1 To data.Count
    Set row = samplerow.Clone
    ' we create a clone of the sample row we got earlier
    ' now let's make our work even simpler - suppose we marked each
  <TD>
    ' with attribute FIELD="somefieldname" to indicate which
  field from the 
    ' data set obtained from the database should be put in that cell.
    For c = 1 To data(r).Count
      Set cell = row.FindByValue("FIELD",data(r).Key(c))(1)
      Set text = textnode.Clone
      text.Root = data(r)(c)
      cell.Add "", text
    Next
  Next
  This code will work fine even if we have one or more heading rows in the
  table. We achieved that by setting a specific ID to the row that must be used
  as a template for all the rows we want to list in the table. This way it has a
  name in the table's node and can be removed directly by name, no need to
  search for it, assume anything that may change if the template design changes
  etc. Certainly there are other ways to do the same and sometimes there will be
  more efficient or convenient ways to deal with similar tasks.
  We also helped ourselves by putting a FIELD attribute in the cell-s. We can
  do without it if we know the number of the cells and the data set and we know
  exactly in which cell what to put by index or over another criterion. 
  When pre-defined chunks of HTML are needed often it is useful to create a
  function that clones certain template element, and fills it with the specific
  data from the function arguments.
  As you already have guessed the Clone method clones not only the node but
  also all its contents. Thus clone allows us to copy part of the tree and put
  it somewhere else. Removing a node from the tree will not remove it from the
  memory if we already saved it in a variable. This way we can get something we
  want to copy many times from the document, remove it from its original
  location and begin to clone and put it wherever we need it.
  CreateNew
  - we used it above to create a plain text node. Why not create a VarDictionary
  directly (using Server.CreateObject ....)? CreateNew creates an empty
  VarDictionary with features like the one over which it is invoked. Thus by
  using the CreateNew over any node we are sure that it will have be configured
  with the same behavior as the node over which the method has been called.
  VarDictionary allows behavior configuration and the best way for us is to
  ensure that all the nodes in the tree have the same behavior. Thus using
  CreateNew we copy the behavior without the need to manually adjust it to match
  the other nodes. Furthermore this ensures that the object will be created
  directly and faster in the same COM apartment as the other node (this may be
  important in some applications). 
  It is often a good idea to set the skipEmptyTexts property of the
  parser to True and thus strip the document tree from all the hollow text
  elements not relevant to the document structure. For instance one such element
  will appear between any two sibling nodes if the document is formatted with
  new lines after the tags and so on. This is a lot of garbage which can be even
  more than the half of the nodes in the tree (depending on the document
  formatting). However be aware that as a result the Construct method
  will output HTML with no new lines between the tags and it will not be much of
  the human readable kind.
  What kind of documents can be parsed?
  With appropriate configuration the parser can be used with most HTML and
  XML documents. Note that some of the properties may need some tuning if an
  error occurs. A good decision is to set knownTagsOnly to True in order to
  ignore anything the parser cannot understand.
  The parser can work with documents that use the corresponding Windows code
  page encoding, but it is not guaranteed to work with UNICODE or UTF-8 HTML/XML
  documents for example. The newer Windows versions support UTF-8 code page, but
  you should know that the Parser itself does not perform its own handling and
  using it will make the application dependent on the Windows versions that have
  the feature.
  The output can be translated to another code page, but this rarely makes
  sense except for templates that do not contain constant texts (translating the
  code page will not translate the text from one language to another - the
  result will be something unreadable in its place). When doing this do not
  forget to change or add the META tag that marks the document encoding.
  What is it for and what is not for?
  The parser can be used for random pages in known language - such as
  indexing them. Also by calling it twice - first to find only the META encoding
  tag and then to parse the whole document with appropriate code page you can
  cope with several different languages and encodings, but not all. Thus search
  engine like usage is somewhat limited, but is enough for wide variety of
  purposes - such as intranets, well known sites and so on. 
  Although some configuration is needed almost any correct HTML or XML can be
  parsed and re-generated back with some changes. Most often it is possible to
  parse the documents partially - by specifying only some tags or embeds that
  are actually of interest and treat the rest as plain text. Depending on the
  usage you can optimize the performance by configuring the parser to work only
  with the elements you actually need. Obviously for applications with extensive
  parsing the performance improvements can be drastic.
  The object is named "Light" because it uses VarDictionary for the
  document tree thus avoiding the need to supply specialized objects for the
  document tree. This is a plus and a minus. Positives:- less objects to learn,
  no matter if the document is HTML, XML or even something else with some HTML/XML
  like inclusions. Negatives: The object model is nothing alike XML DOM or DHMTL
  - only the structure of the tree is the same, but you work with universal
  objects which do not have members named after the HTML or XML standard. 
  Auto-closing  
  The parser uses auto-closing of tags. Thus if a closing tag is found but it
  is not the closing tag corresponding to the last open tag the operation will
  not fail. Instead the last open tag will be automatically closed, then the
  closing tag will be compared to the open tag which contains the current and so
  on until match is found. Thus:
  <TABLE>
  <TR>
  <TD>
    <P> something
  </TD>
  <TD>
    <P> something
  </TD>
  </TABLE>
  will appear as intended in the document tree:
  Table
    TR
      TD
        P
      TD
        P
  
  This is the behavior of the most browsers today and the parser's behavior with
  incorrect HTML is quite common and reflects the typical browser behavior. 
  HTML Encoding/Decoding
  The parser will not decode or encode any texts from/to the document tree.
  Thus when you need to HTML decode or encode certain texts you must use a HTML
  Encoder object to perform Encode/Decode. One such object is usually enough
  for the entire application - you call it wherever you need it.