You are not logged in.

#1 2011-09-13 10:42:19

fmarin156
Member
51 posts

[resolved] parse html to extract some tags (information)

Hi,

i tried to parse html with xml parser (JDOM), but it's not very easy, principally because xml is well formed, but not for html.

By example, some tags could be terminated by the good "closed-tag" (<script>...</script>) or by <tag ..... /> (<script ..... />), and a parser xml terminate with an error in this last case !

So, i have searched a "good" html parser, and i choose "jericho" ; it's relatively well informed, and rather easy to use

Another thing : parsing a xml flow with JDOM (and XERCES) is very long (3 minutes for one page), while the same parsing with jericho takes hardly some seconds

fred

Offline

#2 2011-09-14 04:11:16

shong
Talend Team


Re: [resolved] parse html to extract some tags (information)

Hi fred

Thanks for sharing your solution!

Best regards
Shong


Email:shong@talend.com
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the global leader of open source data management and application integration solutions!

Offline

Board footer