Parsing sites and web-pages with Powershell / Invoke-WebRequest / getElementsByTagName and fight with performance

Sometimes you need to parse some site or big web-page. Google offers many programs and even software complexes to perform this task, but I want to show, how you can quite simply do this with the help of Powershell.

There is a Invoke-WebRequest cmdlet in Powershell, that actually parse HTML-page by tags and content. Cmdlet give you an object of page with ParsedHtml property. To this field you can apply methods to extract needed data.

Assume you need to get all links from the page. See how it works.

$url = "http://site.com/page.html";
$page = Invoke-WebRequest $url -Method Get -DisableKeepAlive;
$elements_a = $page.ParsedHtml.getElementsByTagName('a') | ?{$_.getAttribute('itemprop') -eq "url"};

After you get the result of the last string into $elements_a, you will need to drill into properties of this object and take needed info. In every case your sample data will be different because all sites are different. But now we can use universal example - get page's Title. That will work on every site.

$title = $($page.ParsedHtml.getElementsByTagName('title')).innertext;

For complex sites lets build more complex filters:

$address = $( $page.ParsedHtml.getElementsByTagName('p') | ?{$_.className -eq 'address'} ).innertext;

 

Invoke-WebRequest and getElementsByTagName performance

getElementsByTagName method in Microsoft's realization - is quite heavy and slow. If you pipe it's results into some other cmdlet, script performance could be catastrophically slow. For example, the last example can run up to 30 seconds (if the page is really big).

I don't know the reason of this. I've found this discussion: http://stackoverflow.com/questions/14202054/why-is-this-powershell-code-invoke-webrequest-getelementsbytagname-so-incred. From it I realized, that getElementsByTagName give us a bunch of COM-objects, and we pass these COM-objects into other cmdlet. This is a very slow process which take up to 86% of CPU time.

This this time, I found only one solution on this problem: use 32-bit Powershell in such cases. Don't know why, but it's giving good performance.

Tags: script, powershell (en)

PrintEmail

Add comment


Security code
Refresh