• Index
  •  » Talend Open Studio for Data Integration » Usage, Operation
  •  » [resolved] parse html to extract some tags (information)

#1 2011-09-13 11:42:19

fmarin156
Member
Registered: 2010-11-23
Posts: 51

[resolved] parse html to extract some tags (information)

Hi,

i tried to parse html with xml parser (JDOM), but it's not very easy, principally because xml is well formed, but not for html.

By example, some tags could be terminated by the good "closed-tag" (<script>...</script>) or by <tag ..... /> (<script ..... />), and a parser xml terminate with an error in this last case !

So, i have searched a "good" html parser, and i choose "jericho" ; it's relatively well informed, and rather easy to use

Another thing : parsing a xml flow with JDOM (and XERCES) is very long (3 minutes for one page), while the same parsing with jericho takes hardly some seconds

fred

Offline

#2 2011-09-14 05:11:16

shong
Talend team
Registered: 2007-08-29
Posts: 11169
Website

Re: [resolved] parse html to extract some tags (information)

Hi fred

Thanks for sharing your solution!

Best regards
Shong


Email:shong@talend.com
Choose Talend, Enjoy Talend!
New & Event: Talend Help Center
Talend-->the leader of open source data management and application integration solutions!

Offline

  • Index
  •  » Talend Open Studio for Data Integration » Usage, Operation
  •  » [resolved] parse html to extract some tags (information)

Board footer

Powered by FluxBB