Talend Exchange is the place where Talend community can share items related to Talend opensource products, such as Data Integration, Data Quality and Data Master Management. Contribution is open to any user, no specific validation is needed. As soon as you have your forum account, you automatically get a Talend Exchange account.


tTikaExtractor


  • Author: Fxp
  • Categories: Component
  • First revision date: 2012-01-25
  • Latest revision date: 2012-01-25
  • Compatible with: Data Integration releases 5.0.0, 5.2.3, 5.3.0, 5.4.0
  • Downloads: 268

About: tTikaExtractor use Apache TIKA parser to easily extract information from many different formats like (html, pdf, doc, odt, image, audio, video, ...). See http://tika.apache.org/1.0/formats.html for more information about available parsers.

Revision list

expand/collapse all

Revision 0.1 268 Downloads, Released on 2012-01-25
Download revision 0.1

Compatible with: 5.4.0, 5.3.0, 5.2.3, 5.0.0

This first release parse any document supported by TIKA parser and provide properties to do further processing:
* TIKA Metadata object (METADATA_OBJ property)
* TIKA Metadata as as text (METADATA property)
* Resource content as text (CONTENT property)
* Resource content as XHTML (CONTENT_XHTML property) which could be used in tExtractXMLField for further extraction

If you have trouble parsing some formats, download the complete tika-app jar file from http://tika.apache.org/download.html and replace the one included in that pack which was modified in order to upload the component to exchange which has probably a limit around 18Mo.

Reviews (1)

 Very good... but not maintained ? By FabriceS on May 27, 2014
Very usefull indeed..
But Apache tika project is actually versionned to 1.5. Current ttika extractor is using the 1.0.
I upgraded it manually...

Submit review
Name:*
Email:*
Title:*
Please select your rating*
Review:*


Version Author Released on Rating Downloads
Regex

Temperature in Celsius

1.0 scorreia 2014-08-18
22

Check that a given number is a valid temperature in Celsius.

Regex

Companies House

4.0 mhallam 2014-07-29
86

This Regular Expression is used to match the Companies House 503 Reference number that is given when a customer places an online order.www.companieshouse.gov.uk

Regex

Bulgaria Vat Number

4.0 mhallam 2014-07-29
71

Vat number for Bulgaria. Formats are
BG123456789
BG1234567890

Regex

Austria VAT Number

4.0 mhallam 2014-07-29
89

mysql

Regex

Complex Australian Phone Number

4.0 mhallam 2014-07-29
91

Australian phone number validator. Accepts all forms of Australian phone numbers in different formats (area code in brackets, no area code, spaces between 2-3 and 6-7th digits, +61 international dialing code). Checks that area codes are valid (when entered).

Regex

Bank Routing Transit Number (RTN)

4.0 mhallam 2014-07-29
84

Ensures a given string matches the basic pattern of a bank routing transit number (RTN), used to identify financial institutions on instruments such as checks. Ensures number is nine digits long and has first two digits that comply with American Bankers Association rules.

Regex

ISBN Checker

4.0 mhallam 2014-07-29
108

Expression to check for a valid ISBN number

Regex

International Passport

4.0 mhallam 2014-07-29
80

? 9 characters made up of a combination of numbers and/or letters. Where less than 9 characters it will be padded out to the right with chevrons (

Regex

UK Vehicle Registration Plate Number Plate

4.0 mhallam 2014-07-29
103

AB12 RCY|||CD07 TES|||S33 GTT|||Y999 FVBab12 rcy|||CD07 TIS|||S34 GTT|||Z999 FVB

UK Vehicle Registration Plate / Number Plate format as specified by the DVLA. Accepts both "Prefix" and "New" style. Allows only valid DVLA number combinations as not all are supported. Registration number must be exactly as is displayed on car, hence all letters must be in uppercase and a space seperating the two sets of characters.

Regex

Swedish Personal Nr (Personnummer)

4.0 mhallam 2014-07-29
114

Simple regex for the Swedish personal number. It's in the form: YYMMDD-xxxx where xxxx is an arbitrary number from 0000-9999.

791231-1234

Version Author Released on Rating Downloads
Export

Product Demo

3.0 ctoum 2012-05-31
647

Product & families, with Cafepress pictures.

Data-Model

Clinical Trials: Janus Model Basics

1.0 jaymce 2010-11-22
426

This is a model of the basic of the Janus Clinical Data Repository.
http://www.fda.gov/ForIndustry/DataStandards/StudyDataStandards/ucm155327.htm

Data-Model

D* Demo Model

1.0 ctoum 2010-08-13
769

Model used in the D* Demo.

Export

Talendshop Demo

1.0 ctoum 2010-08-04
1147

Talendshop Demo (Demo Project)


54 ms