Talend Exchange is the place where Talend community can share items related to Talend opensource products, such as Data Integration, Data Quality and Data Master Management. Contribution is open to any user, no specific validation is needed. As soon as you have your forum account, you automatically get a Talend Exchange account.


tTikaExtractor


  • Author: Fxp
  • Categories: Component
  • First revision date: 2012-01-25
  • Latest revision date: 2012-01-25
  • Compatible with: Data Integration releases 5.0.0, 5.2.3, 5.3.0, 5.4.0
  • Downloads: 248

About: tTikaExtractor use Apache TIKA parser to easily extract information from many different formats like (html, pdf, doc, odt, image, audio, video, ...). See http://tika.apache.org/1.0/formats.html for more information about available parsers.

Revision list

expand/collapse all

Revision 0.1 248 Downloads, Released on 2012-01-25
Download revision 0.1

Compatible with: 5.4.0, 5.3.0, 5.2.3, 5.0.0

This first release parse any document supported by TIKA parser and provide properties to do further processing:
* TIKA Metadata object (METADATA_OBJ property)
* TIKA Metadata as as text (METADATA property)
* Resource content as text (CONTENT property)
* Resource content as XHTML (CONTENT_XHTML property) which could be used in tExtractXMLField for further extraction

If you have trouble parsing some formats, download the complete tika-app jar file from http://tika.apache.org/download.html and replace the one included in that pack which was modified in order to upload the component to exchange which has probably a limit around 18Mo.

Reviews (1)

 Very good... but not maintained ? By FabriceS on May 27, 2014
Very usefull indeed..
But Apache tika project is actually versionned to 1.5. Current ttika extractor is using the 1.0.
I upgraded it manually...

Submit review
Name:*
Email:*
Title:*
Please select your rating*
Review:*


Version Author Released on Rating Downloads
Regex

Training export

0.1 cathyc 2014-06-10
9

Training export

Regex

Minutes Seconds

2.1 mhallam 2014-05-26
144

validates minutes:seconds format

Validates 3:56, 59:59,...
Does not validate 60:59 or 59:60.

SQL

hotmail email

2.0 scorreia 2014-05-26
312

filters email from hotmail.com

Regex

Dutch Postal Code

2.0 fcweeber 2014-05-26
112

Postal Code format verfication (Netherlands)

Matches
9999AA|9999 AA
Non Matches
9999aa|9999Aa|9999 aa|9999 aA

Regex

Dutch Phone Number

2.0 fcweeber 2014-05-26
113

Phone Number format verfication (Netherlands)

Local: Dutch Phonenumber format is area code (3 or 4 digits), phone number ( 7 or 6 digits) - total length (10 digits)
International: country code 0031 and then remove zero from area code
Matches
0031121234567|+31123123456|0121234567|012-1234567|0123-123456
Non Matches
012 1234567|1234567
match 02.31.23.45.67.22 or 004923123467223

Regex

Names with unicode characters

2.0 scorreia 2014-05-26
110

Match people names with unicode characters.
Matches Jean-Marc, Jørn, Mc\'Neelan, Pz. López
Does not match I.B.M.

Regex

FR Phone Number (parenthesis allowed)

2.0 scorreia 2014-05-26
324

match French phone numbers in several format:
matches:
0033 1 47 25 00 00
+33 1 47 25 00 00
(33) 1 47 25 00 00
0033147250000
01-47-25-00-00
01 47 25 00 00
Does not match
0147 250 000

Regex

Email Address (with list of top-level domains)

2.0 mzhao 2014-05-26
172

Check the validity of email addresses.

Regex

Regular Text

2.0 scorreia 2014-05-26
90

match regular text

matches regular text such as \"hello Jean-Baptiste\".
does not match text with any special character such as \"# num\", \"test;\"

Regex

Text

2.0 scorreia 2014-05-26
69

match regular text with punctuation correctly placed.

Version Author Released on Rating Downloads
Export

Product Demo

3.0 ctoum 2012-05-31
595

Product & families, with Cafepress pictures.

Data-Model

Clinical Trials: Janus Model Basics

1.0 jaymce 2010-11-22
399

This is a model of the basic of the Janus Clinical Data Repository.
http://www.fda.gov/ForIndustry/DataStandards/StudyDataStandards/ucm155327.htm

Data-Model

D* Demo Model

1.0 ctoum 2010-08-13
737

Model used in the D* Demo.

Export

Talendshop Demo

1.0 ctoum 2010-08-04
1126

Talendshop Demo (Demo Project)


56 ms