Evaluation

Generated Ontology Evaluation¶

We evaluate MASEO framework across two complementary dimensions in three ontology generation case studies: Infrastructure Ontology, Vehicle Census Ontology (VCO), and Video Game Ontology (VGO). 1. Structural characteristics & CQ coverage: The first dimension combines structural analysis and CQ requirement coverage. Structural characteristics are derived from ontology diagrams, while CQ coverage is assessed from provenance records through expert inspection. 2. Concept label matching & concept coverage: The second dimension evaluates the alignment between concept labels in the generated ontologies and those in the corresponding gold-standard ontologies. We assess this alignment using three matching strategies, namely exact match, lexical match, and semantic match, and report precision, recall, F1-score, and concept coverage for each strategy.

Evaluation Datasets Selection¶

Dataset	Language	CQs	Gold Standard
Infrastructure Ontology	Spanish	5	Gold_Infrastructure.owl
Vehicle Census Ontology (VCO)	Spanish	28	Gold_VCO.owl
Video Game Ontology (VGO)	English	68	Gold_VGO.owl

Evaluation Perspectives¶

Structural Analysis¶

To compare the structural characteristics (e.g., number of classes, object properties, datatype properties, and hierarchy structure) for ontologies, visualization of the ontology is adopted to conduct the analysis. In this project, we have adopted owl2diagram to generate the diagrams.

CQ Coverage¶

The proportion of input CQs that can be traced to at least one ontology element through provenance records. Therefore, we adopted CQ coverage to evaluate how many CQs are actually used in the ontology. Here is the calculation process of the CQ Coverage.

CQCoverage = \frac{|Q_{covered}|}{|Q_{input}|}

Concept Label Matching¶

Concept label matching evaluates whether a concept label in the generated ontology can be aligned with a concept label in the corresponding gold-standard ontology.

To assess concept label alignment, we adopt three matching strategies:

Exact match: labels are considered matched only when they are character-for-character identical.
Lexical match: labels are matched based on character-level similarity using SequenceMatcher from the difflib library.
Semantic match: labels are matched based on embedding cosine similarity using embeddinggemma, hosted locally via the Ollama runtime environment.

Strategy	Method	Tool
Exact	String equality	—
Lexical	Character-level similarity	`difflib.SequenceMatcher`
Semantic	Embedding cosine similarity	`embeddinggemma` via Ollama

The calculation process of Precision, recall, and F1-score is given here:

P=\frac{TP}{TP+FP}, \quad
R=\frac{TP}{TP+FN}, \quad
F1=\frac{2PR}{P+R}

Concept Coverage¶

To further evaluation for concept label mathcing, we adopted Concept Coverage to measure how many concepts in gold-standard ontology are matched.

Here is the calculation process for Concept Coverage

ConceptCoverage^m = \frac{|C^m_{match}|}{|C_{gold}|}, \quad m \in \{exact, lex, sem\}

C_gold is the set of gold-standard concepts
C_match^m is the set of concepts matched under strategy m

Structural Analysis¶

Here is the command to generate the diagram for the generated/gold standard ontology:

python -m owl2diagram \
    dataset/gold_standard_ontology/Gold_VCO.owl \
    gold_VCO.md

python -m owl2diagram \
    dataset/generated_ontology/Gen_VCO.owl \
    gen_VCO.md

Concept label matching¶

Here is the command to evaluate the generated ontology to the gold standard ontology. generate_onto_file_path refers to the local path to the generated ontology, ground_onto_file_path refers to the local path to the gold standard ontology.

cd evaluation
python eva_.py \
    --generate_onto_file_path ../dataset/generated_ontology/Gen_VCO.owl \
    --ground_onto_file_path   ../dataset/gold_standard_ontology/Gold_VCO.owl

Evaluation Result¶

Structural Analysis¶

Here is an example of the result of structural analysis:

Vehicle Census Ontology (VCO) Generated Ontology:

maseo overview

Gold Standard Ontology:

maseo overview

The full structural analysis of three ontologies:

Element	Infrastructure (Gold / Gen)	VCO (Gold / Gen)	VGO (Gold / Gen)
Classes	37 / 14	7 / 15	37 / 13
Object Properties	13 / 12	4 / 18	32 / 33
Datatype Properties	0 / 5	4 / 2	6 / 9
Subclass Relations	15 / 7	1 / 4	24 / 0
InverseOf Axioms	0 / 6	0 / 9	0 / 15
Linked CQs	5 / 5	28 / 17	68 / 37
CQ Coverage	100.0%	60.7%	54.4%

CQ Coverage¶

Dataset	Input CQs	Covered CQs	Coverage
Infrastructure	5	5	100.0%
Vehicle Census (VCO)	28	17	60.7%
Video Game (VGO)	68	37	54.4%

Class Counts Used for Concept Label Matching¶

Dataset	Generated Classes	Gold-standard Classes
Infrastructure	14	40
Vehicle Census (VCO)	15	10
Video Game (VGO)	13	37

Concept Label Matching Results¶

Dataset	Strategy	Precision	Recall	F1-score	Coverage
Infrastructure	Exact	0.071	0.024	0.037	0.025
Infrastructure	Lexical	0.750	0.500	0.600	0.500
Infrastructure	Semantic	0.667	0.750	0.705	0.605
Vehicle Census (VCO)	Exact	0.333	0.333	0.333	0.400
Vehicle Census (VCO)	Lexical	0.733	0.643	0.685	0.700
Vehicle Census (VCO)	Semantic	0.708	0.846	0.772	0.846
Video Game (VGO)	Exact	0.250	0.167	0.200	0.135
Video Game (VGO)	Lexical	0.690	0.735	0.712	0.676
Video Game (VGO)	Semantic	0.924	0.712	0.804	0.712