Sentence Analyzer - enabling business and enterprise applications to handle sentences and text

What is this about ?    It is about using English sentences easily in business software applications. Adding a massive 2.0 level capability, with no fuss and minimal effort.

User participation clearly shows that text is very important. Popular applications are text-driven, (e.g. facebook, twitter, blogs, chats, forums - users love text !). It is peril to ignore free text, yet language techniques find no place in database-style enterprise software.
Text scans can be used for information retrieval, dashboards and reports, alerts and sentiment capture. Even for business process decisioning, after some confidence gain.
Using this toolkit, you can machineScan any English text : enterprise documents, web pages, user comments, workflow task comments, tweets, blogs, forum postings. Every analytics, BI, and mining component can exploit this powerful capability.
Please also to review a high resolution document extraction tool at text2data.net, if information retrieval from semi-structured documents is your priority.
Going a bit further, you may want to compete with Google (see here), and question the basic validity of word-to-document indexing in text searches. In that case, this toolkit allows a different way of looking at words, sentences, paragraphs, and documents. That topic is a bit too involved to discuss here.
More later. View the raw analysis and XML creation below, please visit sentence comparisons, the find/search/filter, or crunchings, and the wildcard mode.
Please also view some real life business use cases , like handling the Notes section in annual reports, crunching of a Presidential speech, executive profiles, project statuses, an USPTO events alerter, product/customer reviews.   For questions : kinshuk_in @ yahoo dot. com
Note : The RESULTS shown are the human readable equivalent of Java/C# objects, and they have lots of additional information, like word meanings, group codes like colors and flavors etc., intended for further analytical/statistical treatment. This demo (created with no NLP APIs), stresses that with text, it is better to first maximize grammar based processing, and use statistics/math methods much later.
Sentences must be separated from each other by an ending period (. or ! or ?) and one space. Skip the descriptive stuff and go directly to demo
Basic analysis demo : XML tree, phrasals, fragments. ( Click more samples :             then Analyze)
Please enter a complete sentence. (only one, will ignore multiples).
(Please scroll down for the RESULTS)     
Summary results : Sentence parsed. 1 sentence(s) or sub-sentences found.
This is the XML fingerprint that any biz application can re-use.
Here is the phrasal deconstruction (minor nodes chopped). A review from the bottom upwards might be easier. This "proves" the correctness of the XML.

The API makes available two levels of output, 1. a low-level XML (or its equivalent object), and 2. a set of grammatical fragments of your choice. Both are machine processable for a wide variety of actual business use cases. The XML is like an individualistic "fingerprint" of a sentence. The grammatical fragments represent possible classifications of the different kinds of fingerprints. You may want derive different higher level relationships for your specific usage, say, use modifiers differently for a sentiment analysis component. Most things in this toolkit, including the rules of Universal grammar and English grammar, what fragments to pick, comparison yardsticks etc. are customizable to quite an extent. Changes are quite easy.

So, what can be done using the analyzed output ? This is where this toolkit scores a major plus over similar software. Business relevant methods (say, find/sort) are available in a flexible easy-to-use way. The two outputs from the raw analysis step have been used to do sentence comparisons, or find/search/filters, or crunchings, and wildcard search. The business app developer will not need to wrestle with NLP methods or statistics, yet experts can use those later if needed, and the end-use tools can become user configurable at the level of sentences.
More results from the basic analysis are shown below (please scroll down).
This synopsis (hopefully) captures the gist of the sentence
Or sometimes, a looser summary capture can be more useful
Much of the analysis was based on this original assumption, with minor deviations taken in stride.
Nouns, subjects and objects, with their modifiers, qualifiers
Verbs, eventualities, with their modifiers, qualifiers
Note : Also extracted, but not shown here, are 6 other relationships, and a set of subsidiary non-XML sentence patterns. For a good sentence analysis there can be many higher level rules, i.e. "more Latin and Greek education for a machine". Actually, this demo shows only some in-principle capabilities, its high level grammar is a to-do.
  • Features and pluses in the demo you may not find elsewhere on the planet :
  • Tackles most kinds of sentences. Few exceptions.
  • Handles nested, compound, complex sentences, dialogs etc.
  • Ability to select best-fit word variants (i.e. "He ducked the ball").[This ability is suppressed in this demo]
  • An XML output is the best kind of representation for direct use in today's biz apps.
  • The phrasal de-construction (see above) shows that the XML is not an arbitrary and meaningless grouping.
  • Many logic and rules used are externally changeable, improvable.
  • Business jargon specific dictionaries can be inserted easily.
  • Many and extendable number of derived patterns, objects, fragments.
  • The multitude of such patterns makes it easy to differentiate "content" from "structure".
  • As you view other demo pages, you may find that this differentiation, and control over the two, are crucial to the handling of sentences.
  • Two levels of grammar (universal and English) increases possibility of extending it to other languages.
  • This as well as text2data.net, represent a conceptually different way of handling text, documents etc., about which only results can speak, for now.
  • While viewing all the other demo pages please keep in mind:
  • The focus of this toolkit is on sentences, and sentences alone. It considers sentences as the smallest unit of information, attempts nothing further or lesser.
  • The intended target sources are documents/texts that have a discernible and predictable structure.
  • Handling unstructured inputs is a secondary intention of this toolkit, awaits grammar expertise.
Technical info - the working jar is about 1 MB, dependencies are about 5 MB. Measured speed on my PC is about 1000 sentences analyzed per minute, with no code optimization effort. The current API has 2-3 interface methods, compare(s1,s2,config) and summarize(s1, config). I think 4-5 more coarse grained methods can be added if we plugin other NLP APIs (none included at this point), like NER etc. , or add statistical APIs (no statistical methods used as of now). No sentiment mapping implemented in this demo but some options indicated. Final implementations could use all that, this demo is just indicative.
More view/try pages here.
Generic use cases
Back to home/basic analyzer
Comparing sentences, several modes
Find/search/sort/filter
Crunching a big text
Wildcard usages in pure structure mode
Business use cases
Handling Notes section in annual reports
Crunching of a Presidential speech
Executive profiles
Project statuses
USPTO events alerter
Customer reviews
Back to home/basic analyzer

More reading for those interested...

1. Things that are not obvious from the demo
2. Business products and possibilities
3. So what !! Universal grammar has been in use for decades now ...
4. The inevitable comparisons, to what already exists out there.
5. What is a sentence, to future application builders ?
6. The genesis and design principles story
7. Arbitrary listing of business usages
8. Extensions, additions, customizations possible in the toolkit
9. Important : Combining with the document extraction tool at text2data.net, benefits
Raise the bar! RDBMS databases are easy, but they do not truly represent human content. Think applications differently, with text.
Contact at : kinshuk_in @ yahoo dot. com