The aim of MIEX (Metadata and Information Extractor from small XML documents) is to create a wrapper for the Stanford Parser, developed by The Stanford Natural Language Processing Group, to extract and store metadata (syntactic structures, relationships among words...) from simple XML documents in order to apply IR methodologies to retrieve useful information afterwards.
MIEX is being developed as my final year project at the University of Oviedo. This application analyses a batch of XML documents and navigates through Stanford Parser's output trees to store all the extracted semantic information into a MySQL database.
This project was mainly developed to process small collections with this structure:
<topic> <d>debian</d> <d>linux</d> <d>OS</d> </topic>
<body> Debian is a free operating system (OS) for your computer. An operating system is the set of basic programs and utilities that make your computer run. Debian uses the Linux kernel (the core of an operating system), but most of the basic OS tools come from the GNU project; hence the name GNU/Linux </body>
<topic> <d>computing</d> <d>search</d> <d>engine</d> </topic>
<body> Google Inc. (NASDAQ: GOOG and LSE: GGEA) is an American public corporation, specializing in Internet search and online advertising. The company had 10,674 full-time employees as of December 31, 2006, and is based in Mountain View, California. </body>
It's planned to provide a configuration mechanism to point MIEX which fields are suitable for extract semantic information, on the other hand, XML input files are validated using a XML schema supplied by the end user. Let's see how MIEX works.
For more information, development and end-user tools (Documentation, Git, feature requests, project status...) look at the project page at SourceForge.
The following people are involved in this project.
In order to get the latest changes and features, it is recommended to grab a fresh snapshot from git repository:
Please note that SF.net svn/git repositories are no longer being used.
MIEX is licensed under a GPLv2 License, otherwise MIEX works thanks to a set of third-party libraries detailed in the following list.