Friday, May 9, 2014

Automated HTML Format E-Book Concordance Generator

Started a new programming project; an automatic concordance generator for HTML based E-Books. Given a list of terms, it searches an HTML document and inserts local hyperlinks where it finds each term. Once it is done, it returns a list of all the local hyperlinks as another HTML document to be used as the starting point for a hand edited index. Created to speed up process of creating an index for an E-Book, a free .PDF based alternative program was found but not one for .HTML formats.

It works as intended at it's most basic functionality, but needs more refinement in the design. For example when two concurrent paragraphs contain the same term the program will find and mark both paragraphs, this is an unintended result. Index entries do not need multiple links to the same page from the same index entry as it will bloat the index and confuse users.

Noted Issues and areas for improvement:
  • Unnecessary index entries from close proximity terms.
  • No method of specifying areas to skip concordance generation.
  • Duplicate search terms with different punctuation get treated as different terms, leading to duplicated results.
  • Performance speed is terrible because of the recursive nature of the current search algorithm. I noted a 5 minute average for my 150 page test document, stripping the header and title HTML (which leaves body of content untouched) resulted in 01:40 minute average, a 33.3% improvement. I am confident the performance speed can be improved further.
  • Still have not implemented multi-threading in my applications, leading them to appear unresponsive while working on an operation for long periods.
  • Has no functionality for opening documents with preexisting index and being able to edit, add, or merge with existing results.
  • Ideally would have a graphic user interface where editors could quickly browse the document and manually add in index entries, as well as those generated by the concordance, and be able to dynamically manage the data.
  • Ideally would like to have dictionary support, being able to categorize terms as nouns, verbs, singular or plural etc, and then being able to compare them against words of similar meaning and make smarter recognitions.

In short trying to have more levels of smarter automation, to save editors time and effort.

A pipe dream would be to create an artificial intelligence capable of doing more than concordance, recognizing ideas conveyed in text and then being able to create a 100% machine generated index, rather than concordance. The massive raw data needed for word identification and networking, and pattern recognition algorithm would be huge endeavors in themselves, but not impossible. Perhaps worth some pursuit when time and money are in more abundance.

No comments:

Post a Comment