i tried to parse html with xml parser (JDOM), but it's not very easy, principally because xml is well formed, but not for html.
By example, some tags could be terminated by the good "closed-tag" (<script>...</script>) or by <tag ..... /> (<script ..... />), and a parser xml terminate with an error in this last case !
So, i have searched a "good" html parser, and i choose "jericho" ; it's relatively well informed, and rather easy to use
Another thing : parsing a xml flow with JDOM (and XERCES) is very long (3 minutes for one page), while the same parsing with jericho takes hardly some seconds
Thanks for sharing your solution!
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the global leader of open source data management and application integration solutions!