Seaglex Software: Technology

Technology

Seaglex Software has developed unique technologies for extracting, structuralizing and classifying data from the Internet and Intranets and for automating human interactions with Web sites. We leverage our revolutionary technological platform to deliver packaged vertical industry-specific products and custom information extraction solutions. Our technologies are described below.

HTML Pattern Recognition

This technology allows users to convert arbitrary Internet/Intranet HTML sources into streams of structured XML data. It employs sophisticated pattern recognition algorithms to detect repetitive and recursive patterns with minimum help from the user. The parsing scripts generated at the end of the recognition process can then be automatically run to extract data from the HTML sources. These routines can extract data even if the HTML source appearance varies every time new data is retrieved. No manual programming is necessary; user interacts with the application in very much the same fashion as when browsing the Web by pointing, clicking, and annotating.

The following is an XML-structured representation of news headlines as generated by our tools:

Playback/Recording Technology

This technology records user interactions with HTML sources in such a fashion that user actions can then be replayed in repetitive cycles and on a massive scale. Recordings of user actions become parameterized procedures that can be called with different arguments through programmatic Application Programming Interfaces (API’s). Again, user interacts with the application in the Web-browsing fashion and is not required to develop any custom code.

This technology also allows XML-driven automation of form input and secure login procedures by presenting all available fields as nodes of an XML schema. The following example illustrates this concept:

Creation of Uniform HTML Source Schemas and XML Schema Unification

By combining HTML pattern recognition with playback/recording technology, we produce uniform XML schemas that correspond to data contained in Internet/Intranet sites. Our unification technology allows merging XML data from disparate sources into one uniform XML-structured data stream and integrating extracted information with external third-party XML data.

Please see Development Tools for our unification architecture.

Textual Content Filtration and Extraction

By combining our data extraction techniques with sophisticated filtration methods, we are able to detect contiguous “article-like” textual content and distinguish it from collections of links, common HTML page elements, and other irrelevant “noise.”

In the following examples, only the colored article on the second page will be extracted, while the first page and non-article elements on the second page, including inline graphics, will be rejected:

Statistical Data Classification

Since even the information located in the “right” section of a newspaper or company site may be totally irrelevant, further content filtration should be applied to ensure very narrow focus of delivered data. For example, the real estate section of The New York Times contains mostly residential and occasional commercial real estate articles; for a commercial real estate analyst, all residential real estate news articles are “noise” that should be filtered out. We use Naïve Bayesian statistical algorithms and other machine learning techniques to distinguish between relevant textual content and “noise.”

Duplicate Identification

For unstructured data, we are able to speed up content processing and avoid processing duplicate information by detecting not only exactly identical documents but also those that are nearly identical, such as news articles that are reprinted or referred to by multiple sites.

Back to Top