| Timestamp | 2009/11/03 16:30:37 GMT | |||||||||||||||||||||||||
| Date and time at which the query was submitted. | ||||||||||||||||||||||||||
| Feature score table | terms.csv | |||||||||||||||||||||||||
| CSV spreadsheet detailing the calculation of the feature support scores. | ||||||||||||||||||||||||||
| Relevant PubMed IDs | positives.txt | |||||||||||||||||||||||||
| List of PubMed IDs of the relevant training examples. Dividing the file into 10 parts yields the cross validation folds. | ||||||||||||||||||||||||||
| Irrelevant PubMed IDs | negatives.txt | |||||||||||||||||||||||||
| List of PubMed IDs of the irrelevant examples (randomly sampled from Medline). Dividing the file into 10 parts yields the cross validation folds. | ||||||||||||||||||||||||||
| Feature score method | scores_laplace_split | |||||||||||||||||||||||||
| Name of the method used to calculate feature scores. Docstring for the method: For feature probabilities we use a Laplace prior, of 1 success and 1 failure in total, split between the classes according to size. This avoids problems with class skew. | ||||||||||||||||||||||||||
| Number of folds | 10 | |||||||||||||||||||||||||
| Number of partitions into which the relevant and irrelevant data sets were split. | ||||||||||||||||||||||||||
| Prior score | -4.62335415642 | |||||||||||||||||||||||||
| The log ratio of relevant to irrelevant articles in the cross validation data. This prior log ratio is added to log likelihood ratios to obtain posterior article scores. | ||||||||||||||||||||||||||
| Base score | -96.4007276657 | |||||||||||||||||||||||||
| The log likelihood ratio of an empty article (one in which every feature failed to occur). | ||||||||||||||||||||||||||
| Min Document Frequency | 0 | |||||||||||||||||||||||||
| Minimum Document Frequency. In each fold, we select features having at least this many occurrences in the training corpus. | ||||||||||||||||||||||||||
| Min Information Gain | 2e-05 | |||||||||||||||||||||||||
| Minimum Information Gain. In each fold, we select features having at least this relative information gain (information gain divided by entropy of original class variable. | ||||||||||||||||||||||||||
| Random Seed | None | |||||||||||||||||||||||||
| Random seed for shuffling the data. If None, the random seed is set using the system clock. | ||||||||||||||||||||||||||
| Score threshold | 0.220 | |||||||||||||||||||||||||
| If an article has a score greater than or equal to this value, classify it as relevant. The threshold is either the lowest one >= 0, or may be chosen to obtain break-even, maximum F measure, or maximum utility. | ||||||||||||||||||||||||||
| Average Precision | 0.98951 | |||||||||||||||||||||||||
| Precision averaged over all ranks where an article is retrieved. | ||||||||||||||||||||||||||
| Break-Even (precision=recall) | 0.949 | |||||||||||||||||||||||||
| Shared value at the point where Recall = Precision = F1-measure. Typically the F1-Measure at break-even is slightly lower than the maximum F1-Measure. | ||||||||||||||||||||||||||
| Area under ROC curve (AUC) | 0.99974 | |||||||||||||||||||||||||
| Area under the graph of the true positive rate versus false positive rate. Equals the probability that a randomly selected relevant article will be ranked above a randomly selected irrelevant article. | ||||||||||||||||||||||||||
| Standard Error of AUC | 0.00059 | |||||||||||||||||||||||||
| Standard error of the area under the ROC curve. Calculated using the method of Hanley (1982). | ||||||||||||||||||||||||||
| 11-point precision |
|
|||||||||||||||||||||||||
| Precision at recall equal to 0, 0.1, ... 1.0 | ||||||||||||||||||||||||||
The columns of the confusion matrix are actual categories of the documents, and the rows are the predicted categories. Hover the mouse over each of the squares for a full description of the quantity, and the formula for calculating it.
| Actual | Totals | Rates | |||
|---|---|---|---|---|---|
| Relevant | Irrelevant | ||||
| Predicted | Relevant' | TP=489 | FP=146 | P'=635 | PPV=0.77 |
| Irrelevant' | FN=1 | TN=49854 | N'=49855 | NPV=0.99998 | |
| Totals | P=490 | N=50000 | 50490 | Prev=0.00970 | |
| Rates | TPR=1.00 | FPR=0.00292 | Acc=0.99709 | ||
| Precision (PPV) π=TP/(TP+FP) | 0.770 (0.721 to 0.831) | |
| Proportion of predicted positives which are true positives. | ||
| Recall (True Positive Rate / Sensitivity) | 0.998 (0.980 to 1.000) | |
| Proportion of positives which were correctly predicted to be positive. | ||
| F1-Measure (α=0.5) (2*ρ*π/(ρ+π)) | 0.869 (0.835 to 0.907) | |
| Harmonic mean of recall and precision at the threshold corresponding to the maximum α-weighted F-Measure. | ||
| F-Measure (α=0.5) (1/(α/π+(1-α)/ρ)) | 0.869 (0.835 to 0.907) | |
| The F measure evaluated using the given alpha. 0 <= α <= 1 controls the weight of precision. When α=0.5, F=F1. | ||
| Maximum possible F1-Measure | 0.953 | |
| This is the F_1 measure that would be achieved if we had set α=0.5 | ||
Utility is a weighted sum of True and False positives. A false positive has utility -1, and a true positive has utility ur, by default equal to N/P (the assumption being that returning all the articles should result in utility of zero).
Hence, U = (ur * TP - FP)/Umax where Umax = ur * P is the maximium achievable utility. If ur defaults to N/P this reduces to U=(TP/P)-(FP/N).
| Utility (ur=102.04) | 0.995 (0.976 to 0.998) |
| Maximum possible utility | 0.996 |
| Prevalence in cross validation P/(P+N) | 0.00970 | |
| Proportion of training data which was positive. | ||
| False Positive Rate (FPR) FPR=FP/(TN+FP)=1-TNR | 0.00292 (0.00200 to 0.00380) | |
| Proportion of negatives which were incorrectly predicted to be positive. | ||
| Specificity (TNR) TNR=TN/(TN+FP)=1-FPR | 0.99708 (0.99620 to 0.99800) | |
| Proportion of negatives which were correctly predicted to be negative. | ||
| Error Rate (FP+FN)/(P+N)=1-Accuracy | 0.00291 (0.00198 to 0.00376) | |
| Enrichment (= precision/prevalence) | 79.350 (74.250 to 85.576) | |
| Precision over prevalence. This is is how much better this classifiers precision is over a classifier which calls everything positive. | ||
| Quantity | Relevant Docs | Irrelevant Docs |
|---|---|---|
| Number of documents | 490 | 50000 |
| Number of selected, occurring features | 9948 | 36876 |
| Total occurrences of selected features | 58210 | 2481684 |
| Selected features per Medline record | 118.796 | 49.634 |
| Of the considered feature types, 39623 features are selected out of 281290 occurring at least once in training data. The aggressivity of selection is 7.099. The complete database lists 3703762 potential features. | ||
Normalised histograms (sum of bar areas normalised to 1), approximating probability distributions for relevant and irrelevant article scores. Good performance is associated with clean separation of the distributions.
Normalised histogram approximating the probability distribution for feature feature scores (after training on all available data).
True Positive Rate versus False Positive Rate. The closer to the top left the curve gets, the better. Worst case is a diagonal line (true positives increasing at the same rate as false positives).
Precision as a function of Recall. The recall corresponding to the chosen threshold is marked with a vertical line. Worst case is a horizontal line at the level of prevalence.
Precision, Recall and F-measure as a function of threshold. The chosen threshold is marked with a vertical line.