During the term, I developed a program that was able to attribute a document to its correct author. In other words, given a sample document, the trained program would be able to indicate who wrote it.
The theoretical basis for the authorship attributor is that different authors have different active vocabularies and different word preferences. Hence, this difference can be exploited to differentiate between writings of different authors, as long as enough of their prior work is known.
The program employed genetic algorithms to evolve simple rule-based classifier systems. The final performance on the test set, the Federalist Papers, was generally good, although slightly inferior to existing methods.
The theoretical basis for the authorship attributor is that different authors have different active vocabularies and different word preferences. Hence, this difference can be exploited to differentiate between writings of different authors, as long as enough of their prior work is known.
The program employed genetic algorithms to evolve simple rule-based classifier systems. The final performance on the test set, the Federalist Papers, was generally good, although slightly inferior to existing methods.
P | L | M | C | Average % Correctly Classified | Average No. of Active Rules |
100 | 25 | 100 | 5 | 91.67 | 4 |
50 | 25 | 100 | 5 | 78.33 | 5 |
100 | 10 | 100 | 5 | 88.33 | 2 |
100 | 25 | 10 | 5 | 73.33 | 11.2 |
100 | 25 | 100 | 25 | 83.33 | 4.6 |
100 | 25 | 10 | 25 | 78.33 | 7.6 |
The key weakness of my proposed approach is that is too simple, employing only a voting framework of rules. Furthermore, the individual rules consider only the relative frequencies of pairs of word, which is an approach that is not generalizable. In other words, while the existing approach may be useful for a pairwise classifier, it is likely to be useless for developing a universal attributor capable of differentiating between any number of candidate authors.
Still, in terms of potential, I believe that this particular research project has more potential for improvement than my previous research projects. In particular, the rule combining system can be improved from the voting framework currently used. Furthermore, the rules used can be made more general versions with little modification. Additional criteria could also be included.
I hope to revisit this problem at a later point in time.
Still, in terms of potential, I believe that this particular research project has more potential for improvement than my previous research projects. In particular, the rule combining system can be improved from the voting framework currently used. Furthermore, the rules used can be made more general versions with little modification. Additional criteria could also be included.
I hope to revisit this problem at a later point in time.
1 comment:
That is a cool project!
Post a Comment