oa South African Journal of Information Management - Automatic extraction and analysis of financial data from the EDGAR database



In this article the authors discuss a new methodology of extracting financial data from the Electronic Data Gathering, Analysis and Retrieval (EDGAR) database of the Securities and Exchange Commission (SEC) which contains financial information of about 68,000 companies. In documents of this database, for example 10-K or 10-Q filings, the beginning of a balance sheet or income statement for a single company and a single year is sometimes introduced with some SGML tags and the financial data itself like balance sheet items are in pure ASCII format. We introduce text mining procedures to detect relevant financial data in these documents. This is accomplished by dextrapi (data extraction API), a wrapper for extracting information from any text-based source. The extracted information is then transformed into machine understandable XML syntax enabling and supporting quick trading decisions of stock market investors. The advantage of dextrapi over existing wrappers, for example the World-Wide Web Wrapper Factory (W4F) or the Java Extraction and Dissemination of Information (JEDI) wrapper, lies in its ability to adapt the extraction process on the semistructured input whereas most other wrappers rely on fixed data formats for extraction (e.g. extracting only HTML documents). Furthermore we introduce Edgar2xml, a software agent based on dextrapi wrapper enabling to automate the process of extracting and evaluating balance sheet data and related information from the EDGAR database. Evaluation is done with XML output which conforms to an XML schema, that is a set of rules for descriptionbing the underlying document structure of the XML document.


Article metrics loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error