Evaluation
Generated Ontology Evaluation¶
We evaluate MASEO framework across two complementary dimensions in three ontology generation case studies: Infrastructure Ontology, Vehicle Census Ontology (VCO), and Video Game Ontology (VGO). 1. Structural characteristics & CQ coverage: The first dimension combines structural analysis and CQ requirement coverage. Structural characteristics are derived from ontology diagrams, while CQ coverage is assessed from provenance records through expert inspection. 2. Concept label matching & concept coverage: The second dimension evaluates the alignment between concept labels in the generated ontologies and those in the corresponding gold-standard ontologies. We assess this alignment using three matching strategies, namely exact match, lexical match, and semantic match, and report precision, recall, F1-score, and concept coverage for each strategy.
Evaluation Datasets Selection¶
| Dataset | Language | CQs | Gold Standard |
|---|---|---|---|
| Infrastructure Ontology | Spanish | 5 | Gold_Infrastructure.owl |
| Vehicle Census Ontology (VCO) | Spanish | 28 | Gold_VCO.owl |
| Video Game Ontology (VGO) | English | 68 | Gold_VGO.owl |
Evaluation Perspectives¶
Structural Analysis¶
To compare the structural characteristics (e.g., number of classes, object properties, datatype properties, and hierarchy structure) for ontologies, visualization of the ontology is adopted to conduct the analysis. In this project, we have adopted owl2diagram to generate the diagrams.
CQ Coverage¶
The proportion of input CQs that can be traced to at least one ontology element through provenance records. Therefore, we adopted CQ coverage to evaluate how many CQs are actually used in the ontology. Here is the calculation process of the CQ Coverage.
CQCoverage = \frac{|Q_{covered}|}{|Q_{input}|}
Concept Label Matching¶
Concept label matching evaluates whether a concept label in the generated ontology can be aligned with a concept label in the corresponding gold-standard ontology.
To assess concept label alignment, we adopt three matching strategies:
- Exact match: labels are considered matched only when they are character-for-character identical.
- Lexical match: labels are matched based on character-level similarity using
SequenceMatcherfrom thediffliblibrary. - Semantic match: labels are matched based on embedding cosine similarity using
embeddinggemma, hosted locally via the Ollama runtime environment.
| Strategy | Method | Tool |
|---|---|---|
| Exact | String equality | — |
| Lexical | Character-level similarity | difflib.SequenceMatcher |
| Semantic | Embedding cosine similarity | embeddinggemma via Ollama |
The calculation process of Precision, recall, and F1-score is given here:
P=\frac{TP}{TP+FP}, \quad
R=\frac{TP}{TP+FN}, \quad
F1=\frac{2PR}{P+R}
Concept Coverage¶
To further evaluation for concept label mathcing, we adopted Concept Coverage to measure how many concepts in gold-standard ontology are matched.
Here is the calculation process for Concept Coverage
ConceptCoverage^m = \frac{|C^m_{match}|}{|C_{gold}|}, \quad m \in \{exact, lex, sem\}
C_goldis the set of gold-standard conceptsC_match^mis the set of concepts matched under strategym
Structural Analysis¶
Here is the command to generate the diagram for the generated/gold standard ontology:
python -m owl2diagram \
dataset/gold_standard_ontology/Gold_VCO.owl \
gold_VCO.md
python -m owl2diagram \
dataset/generated_ontology/Gen_VCO.owl \
gen_VCO.md
Concept label matching¶
Here is the command to evaluate the generated ontology to the gold standard ontology. generate_onto_file_path refers to the local path to the generated ontology, ground_onto_file_path refers to the local path to the gold standard ontology.
cd evaluation
python eva_.py \
--generate_onto_file_path ../dataset/generated_ontology/Gen_VCO.owl \
--ground_onto_file_path ../dataset/gold_standard_ontology/Gold_VCO.owl
Evaluation Result¶
Structural Analysis¶
Here is an example of the result of structural analysis:
Vehicle Census Ontology (VCO) Generated Ontology:

Gold Standard Ontology:

The full structural analysis of three ontologies:
| Element | Infrastructure (Gold / Gen) | VCO (Gold / Gen) | VGO (Gold / Gen) |
|---|---|---|---|
| Classes | 37 / 14 | 7 / 15 | 37 / 13 |
| Object Properties | 13 / 12 | 4 / 18 | 32 / 33 |
| Datatype Properties | 0 / 5 | 4 / 2 | 6 / 9 |
| Subclass Relations | 15 / 7 | 1 / 4 | 24 / 0 |
| InverseOf Axioms | 0 / 6 | 0 / 9 | 0 / 15 |
| Linked CQs | 5 / 5 | 28 / 17 | 68 / 37 |
| CQ Coverage | 100.0% | 60.7% | 54.4% |
CQ Coverage¶
| Dataset | Input CQs | Covered CQs | Coverage |
|---|---|---|---|
| Infrastructure | 5 | 5 | 100.0% |
| Vehicle Census (VCO) | 28 | 17 | 60.7% |
| Video Game (VGO) | 68 | 37 | 54.4% |
Class Counts Used for Concept Label Matching¶
| Dataset | Generated Classes | Gold-standard Classes |
|---|---|---|
| Infrastructure | 14 | 40 |
| Vehicle Census (VCO) | 15 | 10 |
| Video Game (VGO) | 13 | 37 |
Concept Label Matching Results¶
| Dataset | Strategy | Precision | Recall | F1-score | Coverage |
|---|---|---|---|---|---|
| Infrastructure | Exact | 0.071 | 0.024 | 0.037 | 0.025 |
| Infrastructure | Lexical | 0.750 | 0.500 | 0.600 | 0.500 |
| Infrastructure | Semantic | 0.667 | 0.750 | 0.705 | 0.605 |
| Vehicle Census (VCO) | Exact | 0.333 | 0.333 | 0.333 | 0.400 |
| Vehicle Census (VCO) | Lexical | 0.733 | 0.643 | 0.685 | 0.700 |
| Vehicle Census (VCO) | Semantic | 0.708 | 0.846 | 0.772 | 0.846 |
| Video Game (VGO) | Exact | 0.250 | 0.167 | 0.200 | 0.135 |
| Video Game (VGO) | Lexical | 0.690 | 0.735 | 0.712 | 0.676 |
| Video Game (VGO) | Semantic | 0.924 | 0.712 | 0.804 | 0.712 |