Tuesday, December 07, 2004

 

HTML Parser

1.
HTMLParser
http://htmlparser.sourceforge.net/

The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output.

In general, to use the HTMLParser you will need to be able to write code in the Java programming language. Although some example programs are provided that may be useful as they stand, it's more than likely you will need (or want) to create your own programs or modify the ones provided to match your intended application.

To use the library, you will need to add either the htmllexer.jar or htmlparser.jar to your classpath when compiling and running. The htmllexer.jar provides low level access to generic string, remark and tag nodes on the page in a linear, flat, sequential manner. The htmlparser.jar, which includes the classes found in htmllexer.jar, provides access to a page as a sequence of nested differentiated tags containing string, remark and other tag nodes. So where the output from calls to the lexer nextNode() method might be:

html>
head>
title>
"Welcome"
/title>
/head>
body>
etc...

The output from the parser NodeIterator would nest the tags as children of the html>, head> and other nodes (here represented by indentation):
html>
head>
title>
"Welcome"
/title>
/head>
body>
etc...

The parser attempts to balance opening tags with ending tags to present the structure of the page, while the lexer simply spits out nodes. If your application requires only modest structural knowledge of the page, and is primarily concerned with individual, isolated nodes, you should consider using the lightweight lexer. But if your application requires knowledge of the nested structure of the page, for example processing tables, you will probably want to use the full parser.
Extraction
Extraction encompasses all the information retrieval programs that are not meant to preserve the source page. This covers uses like:
text extraction, for use as input for text search engine databases for example
link extraction, for crawling through web pages or harvesting email addresses
screen scraping, for programmatic data input from web pages
resource extraction, collecting images or sound
a browser front end, the preliminary stage of page display
link checking, ensuring links are valid
site monitoring, checking for page differences beyond simplistic diffs
There are several facilities in the HTMLParser codebase to help with extraction, including filters, visitors and JavaBeans.
Transformation
Transformation includes all processing where the input and the output are HTML pages. Some examples are:
URL rewriting, modifying some or all links on a page
site capture, moving content from the web to local disk
censorship, removing offending words and phrases from pages
HTML cleanup, correcting erroneous pages
ad removal, excising URLs referencing advertising
conversion to XML, moving existing web pages to XML
During or after reading in a page, operations on the nodes can accomplish many transformation tasks "in place", which can then be output with the toHtml() method. Depending on the purpose of your application, you will probably want to look into node decorators, visitors, or custom tags in conjunction with the PrototypicalNodeFactory.
The HTML Parser is an open source library released under GNU Lesser General Public License, which basically says you are free to use the library "as is" in other (even proprietary) products, as long as due credit is given to the authors and the source code for the HTMLParser is included or available with the other product. For modified or embedded use, please consult the LGPL license.

2.
Use the MSHTML COM to parse the html file for you

By soarlove
From http://www.vccode.com/file_show.php?id=784

IE自带的MSHTML那套COM库实际上提供了一个轻量级的HTML解析引擎,只是因为MS文档中少有提及,故而知道的人不多。昨天一个朋友提起如何解析HTML的问题,于是写了个例子,大概说明如何使用IE这套轻量级解析引擎。
主要功能是由IMarkupServices接口提供的,通过他解析HTML文档生成一系列的类似iterator的IMarkupPointer,以及类似container的IMarkupContainer接口。
...
主要接口获取流程如下:
...
CLSID_HTMLDocument -> IHTMLDocument2 -> IPersistStreamInit
-> IMarkupServices -> IMarkupContainer, IMarkupPointer
-> IHTMLDocument2 -> IHTMLElement
...
主要实现伪代码如下:
...
CoCreateInstance IHTMLDocument2 from CLSID_HTMLDocument
QueryInterface IPersistStreamInit from IHTMLDocument2
spPersistStreamInit->InitNew()
QueryInterface IMarkupServices from IPersistStreamInit

IMarkupPointer spMkStart, spMkFinish
IMarkupContainer spMarkupContainer
IPersistStreamInit->CreateMarkupPointer(spMkStart)
IPersistStreamInit->CreateMarkupPointer(spMkFinish)

IMarkupServices->ParseString(strHTML, 0,
&spMarkupContainer, spMkStart, spMkFinish)
...
然后,可以通过两种方式对解析后的代码进行访问。
一是通过IMarkupPointer本身进行遍历访问(a stream-based model)
二是解析后通过DOM树进行访问(a tree-based model)。前者功能较强,因为某些HTML无法解析为树状结构,(如Where do you want to go today? 存在交叉的情况);而后者使用起来简单一些,一下示例为简便起见,使用DOM树进行简单遍历(可以递归了,呵呵)
btw:因为用了自己写的一个模板库(ftl那些东东),有些代码可能比较含糊。
如CComContext g_ComContext;是一个简单的COM环境初始化类
struct CComContext {
CComContext() { ::CoInitialize(NULL); }
~CComContext() { ::CoUninitialize(); }
};
其他的代码较长,略去,只把功能大概提及一下,因为不设计关键代码编译不通过的删去即可。其中ComCheck使用较多,其实就是检测返回值出错则抛出一个自定义异常(带自动错误报告功能)。
inline HRESULT ComCheck(HRESULT hr)
{ if(FAILED(hr)) throw CComError(hr); return hr; }
再就是log是一个自定义的Message Tracer,用于日志文件输出。

开发环境:Win2K srv sp2 + VC6 sp5 + IE6
完整代码如下:

// MarkupSvc.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include
#include
using namespace std;
#include
#include
#include
#include
#include
#include "ftl.h"
using namespace ftl;
CComContext g_ComContext;
CExceptionHandler g_ExceptionHandler;
CMsgTracer log;
void printElement(const CComPtr& spElement, int nLevel)
{
CComBSTR bstrTagName, bstrClassName, bstrID;
ComCheck(spElement->get_tagName(&bstrTagName));
ComCheck(spElement->get_className(&bstrClassName));
ComCheck(spElement->get_id(&bstrID));
USES_CONVERSION;
string strIndent(nLevel * 2, ' '),
strTagName(W2A(bstrTagName)), strClassName, strID;
if(bstrClassName)
{
strClassName = " class=\"";
strClassName += W2A(bstrClassName);
strClassName += "\"";
}
else
strClassName = "";

if(bstrID)
{
strID = " id=\"";
strID += W2A(bstrID);
strID += "\"";
}
else
strID = "";
CComPtr spDispatch;
CComPtr spHTMLElementCollection;
ComCheck(spElement->get_children(&spDispatch));
ComCheck(spDispatch.QueryInterface(&spHTMLElementCollection));
long lLength = 0;
ComCheck(spHTMLElementCollection->get_length(&lLength));
cout << strIndent << '<' << strTagName << strClassName << strID;
if(!lLength)
{
CComBSTR bstrInnerText;
ComCheck(spElement->get_innerText(&bstrInnerText));
if(bstrInnerText)
{
cout << '>' << endl
<< strIndent << " " << W2A(bstrInnerText) << endl
<< strIndent << " }
else
{
cout << "/>" << endl;
}
}
else
{
cout << '>' << endl;
CComVariant varName, varIndex;
CComPtr spChildElement;
for(int i=0; i {
spDispatch.Release();
ComCheck(spHTMLElementCollection->item(varName, varIndex,
&spDispatch));

spChildElement.Release();
ComCheck(spDispatch.QueryInterface(&spChildElement));
if(spChildElement)
printElement(spChildElement, nLevel+1);
}
cout << strIndent << " }
}
void print(const CComPtr& spMarkupContainer)
{
log << "-+ Enter print @ " << tmd::timestamp << " +-" << endl << endm;
CComPtr spHTMLDocument2;
ComCheck(spMarkupContainer.QueryInterface(&spHTMLDocument2));
log << "Query interface IHTMLDocument2 from IMarkupContainer OK!" << endl
<<
endm;
CComPtr spBody;
ComCheck(spHTMLDocument2->get_body(&spBody));
printElement(spBody, 0);
log << "-+ Leave print @ " << tmd::timestamp << " +-" << endl << endm;
}
void parse(const char *szFileName)
{
log << "-+ Enter parse @ " << tmd::timestamp << " +-" << endl << endm;
CComPtr spHTMLDocument2;
ComCheck(spHTMLDocument2.CoCreateInstance(CLSID_HTMLDocument));
log << "Create interface IHTMLDocument2 from CLSID_HTMLDocument OK!" <<
endl
<< endm;
CComPtr spPersistStreamInit;
ComCheck(spHTMLDocument2.QueryInterface(&spPersistStreamInit));
log << "Query interface IPersistStreamInit from IHTMLDocument2 OK!" <<
endl
<< endm;
ComCheck(spPersistStreamInit->InitNew());
log << "Initialize interface IPersistStreamInit OK!" << endl << endm;
CComPtr spMarkupServices;
ComCheck(spPersistStreamInit->QueryInterface(&spMarkupServices));
log << "Query interface IMarkupServices from IPersistStreamInit OK!" <<
endl
<< endm;
CComPtr spMkStart, spMkFinish;
ComCheck(spMarkupServices->CreateMarkupPointer(&spMkStart));
ComCheck(spMarkupServices->CreateMarkupPointer(&spMkFinish));
log << "Create interface IMarkupPointer with IMarkupServices OK!" << endl
<<
endm;
ifstream is(szFileName);
stringstream ss;
ss << is.rdbuf();
log << "Read the HTML from " << szFileName << " [size:" << ss.tellp() <<
"]
file OK!" << endl << endm;
CComPtr spMarkupContainer;
ComCheck(spMarkupServices->ParseString(CComBSTR(ss.str().c_str()), 0,
&spMarkupContainer, spMkStart, spMkFinish));
log << "ParseString HTML from " << szFileName << " with IMarkupServices
OK!"
<< endl << endm;
print(spMarkupContainer);
log << "-+ Leave parse @ " << tmd::timestamp << " +-" << endl << endm;
}
int main(int argc, char* argv[])
{
log << "-= begin @ " << tmd::timestamp << " OK =-" << endl << endm;
if(argc == 1)
{
cout << ExtractFileName(string(argv[0])) << " [html filename]" << endl;
return 1;
}
parse(argv[1]);
log << "-= end @ " << tmd::timestamp << " OK =-" << endl << endm;
return 0;
}

Please see the following articles also

Lightweight HTML Parsing Using MSHTML
By Asher Kobin
From www.codeguru.com/Cpp/I-N/ieprogram/article.php/c4385/

Loading and parsing HTML using MSHTML. 3rd way.
By Philip Patrick
http://www.codeproject.com/internet/parse_html.asp

Loading HTML content from a Stream
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/webbrowser/tutorials/webocstream.asp

MSHTML Reference
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp

IHTMLDocument2 Interface
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/ifaces/document2/document2.asp



<< Home

This page is powered by Blogger. Isn't yours?