Sunday, November 30, 2008

Secret Reseach Project 3: Authorship Attributor

During the term, I developed a program that was able to attribute a document to its correct author. In other words, given a sample document, the trained program would be able to indicate who wrote it.

The theoretical basis for the authorship attributor is that different authors have different active vocabularies and different word preferences. Hence, this difference can be exploited to differentiate between writings of different authors, as long as enough of their prior work is known.

The program employed genetic algorithms to evolve simple rule-based classifier systems. The final performance on the test set, the Federalist Papers, was generally good, although slightly inferior to existing methods.

P L M C Average % Correctly
Average No. of
Active Rules
100 25 100 5 91.67 4
50 25 100 5 78.33 5
100 10 100 5 88.33 2
100 25 10 5 73.33 11.2
100 25 100 25 83.33 4.6
100 25 10 25 78.33 7.6
Table: Performance of Classifiers for Given GA Parameters

The key weakness of my proposed approach is that is too simple, employing only a voting framework of rules. Furthermore, the individual rules consider only the relative frequencies of pairs of word, which is an approach that is not generalizable. In other words, while the existing approach may be useful for a pairwise classifier, it is likely to be useless for developing a universal attributor capable of differentiating between any number of candidate authors.

Still, in terms of potential, I believe that this particular research project has more potential for improvement than my previous research projects. In particular, the rule combining system can be improved from the voting framework currently used. Furthermore, the rules used can be made more general versions with little modification. Additional criteria could also be included.

I hope to revisit this problem at a later point in time.

1 comment:

Will Dwinnell said...

That is a cool project!