Please see the masters thesis for detailed information about the classifier. The background chapter also provides a readable introduction to text classification.

To use the service, just paste a list of PubMed IDs into the input box on the submit a task page. There is a Help button for each of the advanced options, but the defaults should be fine. PubMed is a search engine for Medline. This is a classifier, which learns what a topic looks like from the examples provided, to distinguish relevant Medline records from irrelevant ones. This is useful for curating biomedical databases (where hundreds of examples are available, having been used to curate database entries), and also to expand collections of citations.

The more examples the better. In fact, using only one example will often return zero results: it specifies a very narrow topic, and it is unlikely for anything in Medline to share enough features with the example to have a positive score. We recommend using the Related Articles feature found on PubMed if you have just one example: it finds the most similar records in PubMed. You may however set the score threshold down to -500 instead of zero to return even the negative-scoring results from the classifier.

After clicking submit you will be taken to the status page to monitor the progress of the task, until a link to the results appears on the output page. Filtering Medline takes roughly 90 seconds when using MeSH terms, and about 3 minutes if title/abstract words are included. At the bottom of the page, you can see results for the filtering and cross validation runs in the MScanner publication. To evaluate MScanner's performance using cross validation, you will need at least 30 PubMed IDs (see results on sample topics below).

Download the latest source code archive. HTML documentation of the MScanner API is included. MScanner is released under the GNU General Public License.

This site was tested with Mozilla Firefox 2, Internet Explorer 7, Safari 3 and Opera 9. Some features may not work in older browsers or with JavaScript disabled. Please contact us if you have problems.

Below are results for the sample topics published in the MScanner paper in BMC Bioinformatics in February 2008.

Training dataResultsResults
PG07 (Pharmacogenetics, 1595 citations) retrieval validation
AIDSBio (AIDS and Bioethics, 10732 citations) retrieval validation
Radiology (splenic imaging, 67 citations) retrieval validation