This component enables the application to parse HTML files and some other
text files. The main purpose of the component is HTML in some variations (with
ASP or other embedded foreign content for example), still other formats can be
parsed as well, but with some limitations and need of additional configuration.
Creation:
ProgID: newObjects.utilctls.HTMLParser
ClassID: {A253C277-F280-4349-B918-ED94BA6A1A28}
free threaded version
ProgID: newObjects.utilctls.HTMLParser.free
ClassID: {F63A8DFD-B830-4f18-8621-22D8D1AFF366}
Contents:
Members reference
The object tree document structure
Searching for nodes
Cloning/adding elements
Remarks
What kind of documents can be parsed?
What is it for and what is not for?
Auto-closing
HTML Encoding/Decoding
Members reference
Member |
Syntax |
Description |
Parse |
Set result = obj.Parse(string) |
Parses the string as configured and
returns the tree (see below) representing the document. |
Construct |
str = obj.Construct(tree [,bXml]) |
The reverse of Parse method. Constructs a
string from an object tree. |
Configuration members |
ApplySettings |
obj.ApplySettings preset_name |
There are a few pre-defined
configurations which will save you the need to pass though all the
configuration properties and members:
HTML - default HTML parsing
HTMLTEMPLATE - HTML but parses only the elements with existing
attribute named TEMPLATE. The rest of the content is treated as plain
text. This works faster and is enough for applications interested in
modifying only certain elements.
ASP - Parse only the ASP related elements and embedded code
segments (As nodes). The rest is treated as plain text. Good for apps.
which are interested only in separating the code from the static
content.
HTMLASP - HTML + ASP code embeddings. Good for apps. interested in
both the HTML structure and the code. However the application needs to analyze
the SCRIPT tags on its own (i.e. check if they have RUNAT attribute. |
AddTag |
obj.AddTag tag_name [, autoclose] |
Adds a tag to the list of the tags that
are parsed and specifies if it is a self-closing one (default is 0 - not). |
RemoveTag |
obj.RemoveTag tag_name |
Removes a tag from the list of the
tags. |
RemoveTags |
obj.RemoveTags |
Removes all the tags from the parse list. |
AddStdHTMLTags |
obj.AddStdHTMLTags |
Adds all the HTML 3.2 tags to the parse
list |
SetSkipTag |
obj.SetSkipTag tag_name,
bSetRemove |
Adds/Removes depending on the bSetRemove
(True - add, False - Remove) a tag to the list of the "skip
tags". The "skip tags" list may contain only non-selfclosing
tags. The content inside the tags is not parsed. I.e. the parse searches
for the closing tag once the opening is found and the internal content is
treated as plain text. For instance the SCRIPT and the STYLE tags are by
default skip tags for the HTML configuration. |
AddEmbed |
obj.AddEmbed start,end,name |
Adds an embed definition. start and end
are strings defining how the embedding starts and how it finishes. The
name is the name of the embedding by which it can be identified in the
tree (the Info property of a node). The embeddings are supposed to be
elements that violate the general tag syntax - such as comments, ASP code
and so on. The internal content in an embed is treated as plain
text. |
RemoveEmbed |
obj.RemoveEmbed name |
Removes an embed from the parse list |
RemoveEmbeds |
obj.RemoveEmbeds |
Clears the embed parse list |
codePage |
obj.CodePage = x
x = obj.CodePage |
The code page of the parsed
content. |
caseSensitive |
obj.caseSensitive = boolval
x = obj.caseSensitive |
Enables disables case sensitive parsing.
For instance HTML is parsed case insensitively. |
knownTagsOnly |
obj.knownTagsOnly = boolval
x = obj.knownTagsOnly |
If True only the tags in the list
(constructed with AddTag or using a pre-set) are parsed - the rest are
treated as plain text. If False the parser will parse everything that
looks like a tag assuming it is not self-closing unless it finishes
with /> |
aspcCompatible |
obj.aspcCompatible = boolval
x = obj.aspcCompatible |
Work in ASP Compiler compatible style.
This is rarely needed unless you want to use code built for the ASPC
Compile Time Scripting. |
commentTag |
obj.commentTag = string
s = obj.commentTag |
Sets/returns the comment tag start
(usually !--). Rarely needed - it is more flexible to use embeds to define
comments. Still for plain HTML or XML this will work faster. |
ignoreUnknownTags |
obj.ignoreUnknownTags = boolval
x = obj.ignoreUnknownTags |
If True any errors while parsing unknown
tags (knownTagsOnly = False) are ignored and the content is treated as
plain text. This may help you parse the important part of content which
includes some elements with tag-like syntax that cannot be understood by
the parser. |
requiredAttribute |
obj.requiredAttribute = attr_name
x = obj.requiredAttribute |
Specifies the name of an attribute that
must be present in order the element to be included in the result tree. If
empty all the parsed elements are included. This is used for instance in
the HTMLTEMPLATE pre-set where only tags which have TEMPLATE attribute are
included in the tree and everything else is treated as plain text no
matter if it is HTML or not.
This is useful when you are not interested in the entire document
structure, but in certain elements only. |
omitEmptyValues |
obj.omitEmptyValues = boolval
x = obj.omitEmptyValues |
If set to True the Construct will not
output the empty values of attributes. For example if you have NOWRAP
somewhere it will appear in the output as:
NOWRAP if this property is True
and as:
NOWRAP="" if this property is False
Strongly recommended with HTML |
skipEmptyTexts |
obj.skipEmptyTexts = boolval
x = obj.skipEmptyTexts |
Default (false). If set to true the empty
text/plain areas will not be included in the document tree. Note that this
is often related to the way the application works with the tree. For
instance application that depends on this property set to True may assume
that a node contains only attributes and sub-nodes that are HTML elements.
Such a code may not be prepared to see text/plain elements between the
HTML elements and thus may fail if you decide to set this property to
False later in the application's development.
Empty text area is area consisting only of spaces, <CR>,
<LF> and tab characters. By default these areas are included in the
tree as nodes with node.Info = "text/plain" and single unnamed
element with string value containing the actual characters met there (for
example several spaces, new line and then some tabs) . When the property
is set to true these elements are not present in the tree. Such elements
can be considered garbage, still they reflect the way the original
document is formatted and may be important for some applications. Be
careful to consider the need (current and potential need in future) of
them before deciding if you will want to scrap them. |
The Parse method uses VarDictionary
objects to represent the document's tree. Each node has the following structure:
node.Info - contains the tag name
node(name_or_index) - contains an element inside this element
(sub-node) or an attribute. Thus if it is an object it is a sub-node (element
inside this element) and if it is not an object it is an attribute.
node.Key(name_or_index) - is the name of the attribute (if
node(name_or_index) is not an object) or the ID of the sub-element (if
node(name_or_index) is object). The ID is the value of the ID attribute if the
element has such - if it has no ID attribute then its Key is empty. I.e. the
parser treats the ID attribute in a bit specific manner, but this will not
ruin anything if you are not interested in the ID.
The plain text segments - content of a tag, non-parsed parts,
content of the embeds are represented as node same as the nodes for the tags,
but its node.Info = "text/plain"
Thus to change the text somewhere in the document you need to find the
appropriate text/plain node and change its only value (index 1 or empty name).
For simplicity you can also change the Root property of the text/plain node
instead of the first element - this will produce the same results when
you use Generate. However, note that only the Generate method checks for the
Root property and if it is non-empty uses it instead of the first element of
the text/plain node - and this is so only to enable you to use a bit simpler
code for changing texts.
The content of the parsed data is encapsulated in a node with Info =
"document/root"
To illustrate this let us take this HTML file:
<HTML>
<HEAD>
<TITLE>Some title</TITLE>
</HEAD>
<BODY>
<P ALIGN="LEFT">Some text
<INPUT TYPE="TEXT" NAME="Text1"
VALUE="Something">
</P>
</BODY>
</HTML>
Its tree will look as shown below. The letter in the first line is O or V
to mark it this is an Object or Value. The notation is such that it reflects
the expression you would need to use to access each node tree if you have
called the Parse method like this:
Set root = o.Parse(content)
Where the content is a string containing the above HTML code.
O root .Info="document/root"
O root(1) .Info="HTML"
O root(1)(1) .Info="text/plain"
V root(1)(1)(1)=""
O root(1)(2) .Info="HEAD"
O root(1)(2)(1) .Info="text/plain"
V root(1)(2)(1)(1) = ""
O root(1)(2)(2) .Info="TITLE"
O root(1)(2)(2)(1) .Info="text/plain"
V root(1)(2)(2)(1)(1)="Some title"
O root(1)(2)(3) .Info="text/plain"
V root(1)(2)(3)(1)=""
O root(1)(3) .Info ="text/plain"
V root(1)(3)(1)=""
O root(1)(4) .Info="BODY"
O root(1)(4)(1) .Info="text/plain"
V root(1)(4)(1)(1)=""
O root(1)(4)(2) .Info="P"
V root(1)(4)(2)(1)="LEFT" or root(1)(4)(2)("ALIGN")="LEFT"
O root(1)(4)(2)(2) .Info="text/plain"
V root(1)(4)(2)(2)(1)="Some text"
O root(1)(4)(2)(3) .Info="INPUT"
V root(1)(4)(2)(3)(1)="TEXT" or root(1)(4)(2)(3)("TYPE")="TEXT"
V root(1)(4)(2)(3)(2)="Text1" or root(1)(4)(2)(3)("NAME")="Text1"
V root(1)(4)(2)(3)(3)="Something" or root(1)(4)(2)(3)("VALUE")="Something"
O root(1)(4)(2)(4) .Info="text/plain"
V root(1)(4)(2)(4)(1)=""
O root(1)(4)(3) .Info="text/plain"
V root(1)(4)(3)(1)=""
O root(1)(5) .Info="text/plain"
V root(1)(5)(1)=""
Except the attributes by default the sub-elements are unnamed elements of
each node. This is not so only if the sub-element has an ID attribute. In such
case it will be named after the ID attribute's value in its parent's node
collection. For example if we add ID="MyTextBox" to the attributes
of the INPUT in the above example HTML we will be able to access the text box
element not only as:
Set theInpur = root(1)(4)(2)(3)
but also as:
Set theInpur = root(1)(4)(2)("MyTextBox")
Each language has ability to test if certain element of the collection an
object or value. Thus when you need to enumerate only the sub-elements and not
the attributes of a node you use code like this:
' Assuming the above example HTML root(1)(4) is the BODY element.
For I = 1 To root(1)(4).Count
If IsObject(root(1)(4)(I)) Then
' Do whatever you want to do with each element of the BODY
End If
Next
Of course in the real usage you will not specify the indices as numbers
directly - instead you will search for an element and then dig in it or
enumerate the tree and go down the branches you are interested in.
When walking the tree you can use over each node statements like If
node("NAME") = "Somename" to determine the value of
the name attribute of the element (if it is missing empty will be returned).
Another frequently used expression will be for example If node(N).Info =
"A" - after using IsObject over node(N) and receiving True from
it you learned that it is a sub-node/element - node(N).Info will give you the
element name - anchor in this sample.
Still, the walking through the tree requires too much work - for you and
for the machine. Thus usually such techniques are applied to small parts of it
- for example enumerating the rows <TR> of a <TABLE> is often
candidate for this kind of exploration, but after you get the <TABLE> in
some more efficient way (it can be deep in the document structure and there
could be many tables).
As it was stated above the nodes in the document tree are VarDictionary
objects over which you can use any of the VarDictionary
object's method and properties. There are many possible uses of them, but the
Find methods deserve special attention.
Using FindByValue:
This method enables you to find recursively all the sub-nodes of a node
that have a specific value for a specific attribute. For example in the
example above we may want to find the INPUT by its name. We can do this:
Set elements = root.FindByValue("NAME","Text1")
The returned result is again a VarDictionary
collection that contains references to all the found elements that match the
criteria. The search is done in depth, thus we can perform the search for an
element over the entire tree without need to dig certain branch first. Once
the element is found we can search other more specific elements under it if
it has sub-elements too
In this example case we will have the INPUT object in elements(1)
Note that FindByValue has optional parameters that specify how much
elements can be found at most and how deep the search will go - pay
attention to them in order to refine the search as needed.
Over the returned result you can perform a cycle and inspect each found
element for something else, perform some other actions on some or all the
found elements, add new sub-elements to them or alter, add attributes. For
example lets create a little code that will add that ID attribute we used
above to the INPUT in the above example:
Set elements = root.FindByValue("NAME","Text1")
For J = 1 To elements.Count
elements(J)("ID") = "MyTextBox"
Next
This code will work even if we have more elements that NAME attribute
with value "Text1". To all of them an ID="MyTextBox"
attribute will be appended.
If it is possible that there are other elements with
NAME="Text1" in the HTML and we want to be sure that only INPUT
elements of TYPE="TEXT" will be affected we can change the code
this way:
Set elements = root.FindByValue("NAME","Text1",1,10000)
For J = 1 To elements.Count
If elements(J).Info = "INPUT" AND
elements(J)("TYPE")="TEXT" Then
elements(J)("ID") = "MyTextBox"
End If
Next
Using FindByInfo:
This method searches throughout the tree for elements with Info property
containing a specified values. Thus we can use it to find all the elements
of certain type. For example all the paragraphs (P - elements).
Set elements = root.FindByInfo("P",1,10000)
Usually a combination of the both methods is the best way to find what we
search for. For instance if we have many tables we name them somehow (there
a lot of options - from using ID="tablename" attribute through
using a TITLE="name" or NAME="somename" attribute to
using custom non-standard HTML attribute for the purpose. No matter what
method we use we can do it the same way: Find all tables first then in the
result search for all the tables with certain attribute and value. For
example:
Set tables = docroot.FindByInfo("TABLE",1,10000)
Set tablesinquestion = tables.FindByValue("TITLE","DataTable",1,10000)
This assumes that we have some tables marked with attribute TITLE="DataTable"
and we like to find them and do something special with each such table. For
instance we may want to fill the table with some data from a database.
Another attribute may contain the query or other information which will tell
the application which data to show in that table. See Adding elements below.
Note that all the FindXXXX methods have optional arguments after the
criterion arguments. They are first found element to return, maximum elements
to find and search depth. By default only the first found element is returned
in the resulting collection (if any is found, of course). Thus when we expect
to find more than one element we set sensible maximum limit or a very big
limit to allow all the matching elements to be found. If the HTML is big and
we want to optimize the search can be done in series of several elements by
changing the first element and the max number of elements. The depth is by
default unlimited, thus it deserves attention only if optimization is possible
by specifying lower depth - for example we may know that the tables we search
for are not deeper than 5 levels in the document tree, this will save
iterations through the tree for the FindXXXX method as it will not look deeper
than 5 levels at all.
The found elements are references to the elements in the tree, thus by
changing some of their attributes or sub-elements we change the tree at the
location where the element actually is.
The other - FindByName
method should be used more carefully. In contrast to the other two it may
return both nodes and attributes in the found collection. Thus if we perform
Set elements = root.FindByName("TITLE",1,10000) we will receive all
the TITLE attributes in the document and all the elements that have
ID="TITLE". This is most often inconvenient unless used over a
branch where we can be sure that the name searched is used by element ID-s and
no attributes with that name exist. It is recommended to avoid this method
except in scenarios where the HTML parser is used to provide some template
functionality with well known and well-formed element ID-s.
Cloning/adding elements
One of the nicest features of the VarDictionary
object is that it can be cloned. Thus in the primary usage of this object -
HTML templates we can practice creating a complete pre-designed HTML. For
example put in each table that will be filled with the parser one row with
specified colors, styles etc. Then we can find the table, get the row, remove
it from the table and then begin to add rows by cloning the row one time for
each row we want to add. How this will look?
Assume we have marked the table with unique ID="ReportTable".
' We load thedoccontent from a file, db or somewhere else
Set doc = Parser.Parse(thedococntent)
' Parse it
' We will need often to add text inside elements of the tree.
' So it is convenient to create one text/plain node and then
' clone it each time we want to add text somewhere
Set textnode = doc.CreateNew
textnode.Info = "text/plain"
' The same node can be used in all the operations - so if we have more than
' the table we discussed we can use the same node for a template.
' Do something else ...
Set table = doc.FindByValue("ID","ReportTable")(1)
' For brevity we refer directly to the first found element
' It it is not there an error (object required) will occur which will
' be good-enough indication that the template is corrupted or wrong.
' As this will not happen after the development is finished we can save the
' more precise error checking.
Set samplerow = table("TemplateRow")
' We assume that the sample row has ID="TemplateRow" attribute
table.Remove("TemplateRow")
' We remove the sample row from the table - we do not want to show empty
unused rows there
' No it is the time to generate rows from some data. For the sake of the
example let's
' suppose we have SQLite database and we query it
Set data = db.Execute("SELECT * FROM SomeTable")
For r = 1 To data.Count
Set row = samplerow.Clone
' we create a clone of the sample row we got earlier
' now let's make our work even simpler - suppose we marked each
<TD>
' with attribute FIELD="somefieldname" to indicate which
field from the
' data set obtained from the database should be put in that cell.
For c = 1 To data(r).Count
Set cell = row.FindByValue("FIELD",data(r).Key(c))(1)
Set text = textnode.Clone
text.Root = data(r)(c)
cell.Add "", text
Next
Next
This code will work fine even if we have one or more heading rows in the
table. We achieved that by setting a specific ID to the row that must be used
as a template for all the rows we want to list in the table. This way it has a
name in the table's node and can be removed directly by name, no need to
search for it, assume anything that may change if the template design changes
etc. Certainly there are other ways to do the same and sometimes there will be
more efficient or convenient ways to deal with similar tasks.
We also helped ourselves by putting a FIELD attribute in the cell-s. We can
do without it if we know the number of the cells and the data set and we know
exactly in which cell what to put by index or over another criterion.
When pre-defined chunks of HTML are needed often it is useful to create a
function that clones certain template element, and fills it with the specific
data from the function arguments.
As you already have guessed the Clone method clones not only the node but
also all its contents. Thus clone allows us to copy part of the tree and put
it somewhere else. Removing a node from the tree will not remove it from the
memory if we already saved it in a variable. This way we can get something we
want to copy many times from the document, remove it from its original
location and begin to clone and put it wherever we need it.
CreateNew
- we used it above to create a plain text node. Why not create a VarDictionary
directly (using Server.CreateObject ....)? CreateNew creates an empty
VarDictionary with features like the one over which it is invoked. Thus by
using the CreateNew over any node we are sure that it will have be configured
with the same behavior as the node over which the method has been called.
VarDictionary allows behavior configuration and the best way for us is to
ensure that all the nodes in the tree have the same behavior. Thus using
CreateNew we copy the behavior without the need to manually adjust it to match
the other nodes. Furthermore this ensures that the object will be created
directly and faster in the same COM apartment as the other node (this may be
important in some applications).
It is often a good idea to set the skipEmptyTexts property of the
parser to True and thus strip the document tree from all the hollow text
elements not relevant to the document structure. For instance one such element
will appear between any two sibling nodes if the document is formatted with
new lines after the tags and so on. This is a lot of garbage which can be even
more than the half of the nodes in the tree (depending on the document
formatting). However be aware that as a result the Construct method
will output HTML with no new lines between the tags and it will not be much of
the human readable kind.
What kind of documents can be parsed?
With appropriate configuration the parser can be used with most HTML and
XML documents. Note that some of the properties may need some tuning if an
error occurs. A good decision is to set knownTagsOnly to True in order to
ignore anything the parser cannot understand.
The parser can work with documents that use the corresponding Windows code
page encoding, but it is not guaranteed to work with UNICODE or UTF-8 HTML/XML
documents for example. The newer Windows versions support UTF-8 code page, but
you should know that the Parser itself does not perform its own handling and
using it will make the application dependent on the Windows versions that have
the feature.
The output can be translated to another code page, but this rarely makes
sense except for templates that do not contain constant texts (translating the
code page will not translate the text from one language to another - the
result will be something unreadable in its place). When doing this do not
forget to change or add the META tag that marks the document encoding.
What is it for and what is not for?
The parser can be used for random pages in known language - such as
indexing them. Also by calling it twice - first to find only the META encoding
tag and then to parse the whole document with appropriate code page you can
cope with several different languages and encodings, but not all. Thus search
engine like usage is somewhat limited, but is enough for wide variety of
purposes - such as intranets, well known sites and so on.
Although some configuration is needed almost any correct HTML or XML can be
parsed and re-generated back with some changes. Most often it is possible to
parse the documents partially - by specifying only some tags or embeds that
are actually of interest and treat the rest as plain text. Depending on the
usage you can optimize the performance by configuring the parser to work only
with the elements you actually need. Obviously for applications with extensive
parsing the performance improvements can be drastic.
The object is named "Light" because it uses VarDictionary for the
document tree thus avoiding the need to supply specialized objects for the
document tree. This is a plus and a minus. Positives:- less objects to learn,
no matter if the document is HTML, XML or even something else with some HTML/XML
like inclusions. Negatives: The object model is nothing alike XML DOM or DHMTL
- only the structure of the tree is the same, but you work with universal
objects which do not have members named after the HTML or XML standard.
Auto-closing
The parser uses auto-closing of tags. Thus if a closing tag is found but it
is not the closing tag corresponding to the last open tag the operation will
not fail. Instead the last open tag will be automatically closed, then the
closing tag will be compared to the open tag which contains the current and so
on until match is found. Thus:
<TABLE>
<TR>
<TD>
<P> something
</TD>
<TD>
<P> something
</TD>
</TABLE>
will appear as intended in the document tree:
Table
TR
TD
P
TD
P
This is the behavior of the most browsers today and the parser's behavior with
incorrect HTML is quite common and reflects the typical browser behavior.
HTML Encoding/Decoding
The parser will not decode or encode any texts from/to the document tree.
Thus when you need to HTML decode or encode certain texts you must use a HTML
Encoder object to perform Encode/Decode. One such object is usually enough
for the entire application - you call it wherever you need it.