Talend Exchange is the place where Talend community can share items related to Talend opensource products, such as Data Integration, Data Quality and Data Master Management. Contribution is open to any user, no specific validation is needed. As soon as you have your forum account, you automatically get a Talend Exchange account.


tTikaExtractor


  • Author: Fxp
  • Categories: Component
  • First revision date: 2012-01-25
  • Latest revision date: 2012-01-25
  • Compatible with: Data Integration releases 5.0.0, 5.2.3, 5.3.0, 5.4.0
  • Downloads: 219

About: tTikaExtractor use Apache TIKA parser to easily extract information from many different formats like (html, pdf, doc, odt, image, audio, video, ...). See http://tika.apache.org/1.0/formats.html for more information about available parsers.

Revision list

expand/collapse all

Revision 0.1 219 Downloads, Released on 2012-01-25
Download revision 0.1

Compatible with: 5.4.0, 5.3.0, 5.2.3, 5.0.0

This first release parse any document supported by TIKA parser and provide properties to do further processing:
* TIKA Metadata object (METADATA_OBJ property)
* TIKA Metadata as as text (METADATA property)
* Resource content as text (CONTENT property)
* Resource content as XHTML (CONTENT_XHTML property) which could be used in tExtractXMLField for further extraction

If you have trouble parsing some formats, download the complete tika-app jar file from http://tika.apache.org/download.html and replace the one included in that pack which was modified in order to upload the component to exchange which has probably a limit around 18Mo.

Reviews (0)

Be the first to review this extension!

 

Submit review
Name:*
Email:*
Title:*
Please select your rating*
Review:*


Version Author Released on Rating Downloads
ParserRule

Tweets

1.0 scorreia 2013-11-20
19

get information from tweets.
Extract the date/time, user, hashtags, referenced users and urls from Twitter messages.

Regex

Only alphabetical characters not empty

1.0 dcortinovis 2013-06-19
63

Only alphabetical characters not empty.
And at least one (empty forbidden)

Indicator

EMail validation via mail server

5.4/5.3 mzhao 2013-06-03
520

This Java UDI check emails by sending a SMTP request to mail server. the code sample can be found at: http://www.rgagnon.com/javadetails/java-0452.html

Indicator

Frequency table of hours

2.0 scorreia 2013-04-25
358

This indicator helps to analyze the most frequent day hours that appear in date time columns.

Indicator

Sample Standard Deviation

1.1 scorreia 2013-04-25
269

This indicator computes the sample standard deviation of any numerical column

Indicator

Variance

1.1 scorreia 2013-04-25
250

This indicator computes the variance of numeric columns

Indicator

Trimmed

1.0 scorreia 2013-04-25
60

evaluate the number of data which are correctly trimmed

Indicator

Week Frequency

2.0 scorreia 2013-04-25
270

aggregates Date fields into weeks

Indicator

Duplicate Rows

2.0 scorreia 2013-04-25
775

this indicator counts the number of duplicate rows.
It's different from the system indicator called "duplicate count" because it counts the number of duplicate rows, not the number of duplicate values.

Indicator

Length Range Frequency

1.1 scorreia 2013-04-25
123

get length ranges of data.

group data according to their length range.
Ranges are the following:
data of length < 10
data of length < 20
data of length < 30
data of length >= 30
null data

Version Author Released on Rating Downloads
Export

Product Demo

3.0 ctoum 2012-05-31
559

Product & families, with Cafepress pictures.

Data-Model

Clinical Trials: Janus Model Basics

1.0 jaymce 2010-11-22
377

This is a model of the basic of the Janus Clinical Data Repository.
http://www.fda.gov/ForIndustry/DataStandards/StudyDataStandards/ucm155327.htm

Data-Model

D* Demo Model

1.0 ctoum 2010-08-13
706

Model used in the D* Demo.

Export

Talendshop Demo

1.0 ctoum 2010-08-04
1109

Talendshop Demo (Demo Project)


99 ms