You are not logged in.

Unanswered posts

Important! This site has been replaced. All content here is read-only. Please visit our brand-new community at We look forward to hearing from you there!

#1 2011-09-13 10:42:19

54 posts

fmarin156 said:

[resolved] parse html to extract some tags (information)


i tried to parse html with xml parser (JDOM), but it's not very easy, principally because xml is well formed, but not for html.

By example, some tags could be terminated by the good "closed-tag" (<script>...</script>) or by <tag ..... /> (<script ..... />), and a parser xml terminate with an error in this last case !

So, i have searched a "good" html parser, and i choose "jericho" ; it's relatively well informed, and rather easy to use

Another thing : parsing a xml flow with JDOM (and XERCES) is very long (3 minutes for one page), while the same parsing with jericho takes hardly some seconds



#2 2011-09-14 04:11:16

Talend Team

shong said:

Re: [resolved] parse html to extract some tags (information)

Hi fred

Thanks for sharing your solution!

Best regards
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the global leader of open source data management and application integration solutions!


Board footer

Talend Contributor Agreement - Talend Website Privacy Policy