In this article the authors discuss a new methodology of extracting financial data from the Electronic Data Gathering, Analysis and Retrieval (EDGAR) database of the Securities and Exchange Commission (SEC) which contains financial information of about 68,000 companies. In documents of this database, for example 10-K or 10-Q filings, the beginning of a balance sheet or income statement for a single company and a single year is sometimes introduced with some SGML tags and the financial data itself like balance sheet items are in pure ASCII format. We introduce text mining procedures to detect relevant financial data in these documents. This is accomplished by dextrapi (data extraction API), a wrapper for extracting information from any text-based source. The extracted information is then transformed into machine understandable XML syntax enabling and supporting quick trading decisions of stock market investors. The advantage of dextrapi over existing wrappers, for example the World-Wide Web Wrapper Factory (W4F) or the Java Extraction and Dissemination of Information (JEDI) wrapper, lies in its ability to adapt the extraction process on the semistructured input whereas most other wrappers rely on fixed data formats for extraction (e.g. extracting only HTML documents). Furthermore we introduce Edgar2xml, a software agent based on dextrapi wrapper enabling to automate the process of extracting and evaluating balance sheet data and related information from the EDGAR database. Evaluation is done with XML output which conforms to an XML schema, that is a set of rules for descriptionbing the underlying document structure of the XML document.
Soon after the invention of the World-Wide Web, with its terrabytes of information and millions of Web sites, all the role players in the digital information environment realised how difficult it was not only to find relevant and precise information via the Web, but also to stay abreast of the megabytes of information being added to these Web sites on a daily basis. This is an effort to categorize the various approaches that have been developed until recently to assist end-users in identifying and evaluating recently published information. Approaches such as Web casting, Web personalization, collaborative filtering, Web tracking services and more are identified, descriptionbed and illustrated by a number of screen displays.
A Web-based expert system shell for the identification problem is discussed. In the article there is a brief explaination of the identification problem and the a solution that was implemented. The ability to change the user interface at will is a feature of the expert system shell and it is discussed how this was implemented in the system. Lastly the ability to link expert systems together into a network is briefly discussed and some possible applications are named.