NLP - (Software| API )

Text Mining

List

  • Apache Nutch: open source web crawler (Nutch can crawl and post to Apache Solr for search/index.)
  • Apache Tika: detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)
    • Lucene Core: text search engine library
    • Apache Solr: search platform from the Apache LuceneTM project
Library Language Open Source Note
NLTK Python Yes
Gensim Python Yes
spacy.io Python Yes
ElasticSearch (Index and Search) Java Apache 2 (based on Lucene) Guide, Crat (query / SQL layer on top of elasticsearch)
Solr (Index and Search) Java Apache 2 (based on Lucene) Solr
Apache OpenNLP Java Yes
Deepleaerning Java, Scala Yes
Weka Java GPL See https://github.com/fracpete/nlp-weka-package
Standford NLP Java GPL Demo (Part of Speech, Named Entity Recognition, Coreference, Basic dependencies, Collapsed dependencies, Collapsed CC-processed dependencies) Github: http://stanfordnlp.github.io/CoreNLP/ Online Run: http://corenlp.run/
LingPipe Java No Topic Classification, Named Entity Recognition (NER), Sentiment Analysis, …
tm R Yes
rWeka R Yes rJava via JNI
openNLP R Yes rJava via JNI
OCR Tesseract
TweetNLP Java Yes tokenizer, a part-of-speech tagger, hierarchical word clusters, and a dependency parser for tweets
Smile Java LGPL Statistical Machine Intelligence and Learning Engine

Oracle







Share this page:
Follow us:
Task Runner