Powershell parse HTML vs XML – demoandtest.website

Despite the XML and HTML similarities, they are really different and parsing them should be approached differently.

Let us take following example HTML/XML code

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Document</title>
</head>
<body>
    <pre> 
    some preformated text

        line 3
    </pre>
    <pre id = "Id2Find"> 
    some preformated text 2              line 3
    </pre>
    <br/>
    <input type="button" value="ButtonName"/>
</body>
</html>

As this is valid XML it can be parsed using

[xml]$inputContent = Get-Content "some_xml.html"
$inputContent.html.body.input
$inputContent.html.body.pre

The browsers open it and display it correctly (if the extension is HTM(L)). So all is nice, but what if the <br/> tag is not self closing? How many web developers self close the <br> or the <meta> tags? If the self closing is removed, then Windows PowerShell returns following:

Cannot convert value "System.Object[]" to type "System.Xml.XmlDocument". Error: "The 'br' start tag on line 12 position 6 does not match the end tag of 'body'. Line 
14, position 3."
At line:1 char:1
+ [xml]$inputContent = Get-Content "some_xml.html"
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : MetadataError: (:) [], ArgumentTransformationMetadataException
    + FullyQualifiedErrorId : RuntimeException

So is there convenient way to parse the HTML pages with PowerShell? Well… there is for the Windows users (StackOverflow link) but have not found one for the Linux PS users. The following snippet allows to parse and “navigate” through the HTML structure.

$request = Invoke-WebRequest -Uri $URL -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
[string]$htmlBody = $request.Content
$HTML.write([ref]$htmlBody)
$filter = $HTML.getElementsByClassName($htmlClassName)
#or also
$filter = $HTML.body.getElementsByTagName("pre")[1].innerHTML
$filter = $HTML.getElementById("Id2Find").outerHTML

The -Com is the shortened -ComObject

The above method parses even malformed documents, but results might be surprising if extremly confusing input is provided. So for example removing the > in one or the closing tags will not cause stop error (so it can be used not just for unclosed BR tag, but for really incomplete tags).

If the required information is in a tag with predefined Id or with predefined path, then parsing is easy.

Windows VBS will not be discussed, because VBS will soon be depricated (if not already done), because VBS was extremely easy to learn and going from zero to abusing the system was just a tiny step away.

Parsing under Linux can be done, but it is either real pain or external tools have to be used (and they are not always acceptable for real servers).

Anyway following links to StackOverflow discuss the topic Link1, Link2, etc.