OHScore Module

Property prediction metrics and molecular evaluation functions for generated molecules.

Table of Contents

Overview

The OHScore module provides metrics for evaluating generated molecules, including:

  • Validity checking
  • Uniqueness assessment
  • Novelty detection against training data
  • Fréchet ChemNet Distance (FCD)
  • Quality filters for drug-likeness

These metrics are essential for evaluating molecular generation models like OHVAE and optimization results from OHPSO.

Module Structure

OHMind/OHScore/
├── __init__.py          # Package initialization
├── metrics.py           # Evaluation metrics
├── utils.py             # Utility functions
├── alert_collection.csv # Structural alerts
└── rules.json           # Quality filter rules

Architecture

graph TD
    subgraph "Input"
        Generated[Generated SMILES]
        Training[Training SMILES]
    end
    
    subgraph "Preprocessing"
        Generated --> Canonicalize[Canonicalize]
        Canonicalize --> ValidFilter[Valid Filter]
        ValidFilter --> Multisets[Molecule Multisets]
    end
    
    subgraph "Metrics"
        Multisets --> Validity[Validity Check]
        Multisets --> Uniqueness[Uniqueness Check]
        Multisets --> Novelty[Novelty Check]
        Multisets --> FCD[FCD Check]
        Multisets --> Quality[Quality Filters]
        Training --> Novelty
        Training --> FCD
        Training --> Quality
    end
    
    subgraph "Output"
        Validity --> Results[Evaluation Results]
        Uniqueness --> Results
        Novelty --> Results
        FCD --> Results
        Quality --> Results
    end

Key Classes

ValidCheck

Computes the fraction of valid molecules.

from OHMind.OHScore.metrics import ValidCheck

class ValidCheck:
    """
    Computes the number of valid molecules.
    
    A molecule is valid if:
    - It can be parsed by RDKit
    - It passes sanitization
    
    Parameters
    ----------
    gen : list[str]
        List of SMILES strings
    n_jobs : int
        Number of threads for calculation
        
    Returns
    -------
    float
        Fraction of valid molecules (0-1)
    """

Example Usage

from OHMind.OHScore.metrics import ValidCheck, get_mol, mapper

# Check validity
generated_smiles = ["CCO", "CCCO", "invalid_smiles", "c1ccccc1"]

# Using mapper for parallel processing
n_jobs = 4
gen = mapper(n_jobs)(get_mol, generated_smiles)
validity = 1 - gen.count(None) / len(gen)

print(f"Validity: {validity:.2%}")  # 75%

UniquenessCheck

Checks for unique molecules in generated set.

from OHMind.OHScore.metrics import UniquenessCheck

class UniquenessCheck:
    """
    Computes uniqueness of generated molecules.
    
    A molecule bag is unique if it contains at least one molecule
    that has not been generated so far.
    
    Returns
    -------
    float
        Fraction of unique molecule bags (0-1)
    """
    
    def __call__(self, valid_molecule_bags):
        """
        Parameters
        ----------
        valid_molecule_bags : list[multiset.BaseMultiset]
            List of molecule multisets
            
        Returns
        -------
        float
            Uniqueness score
        """

Example Usage

from OHMind.OHScore.metrics import UniquenessCheck, filter_valid_and_map_to_ms

# Generate some molecules
generated_smiles = [
    "CCO",
    "CCO",  # Duplicate
    "CCCO",
    "c1ccccc1",
    "c1ccccc1",  # Duplicate
]

# Convert to valid multisets
valid_ms = filter_valid_and_map_to_ms(generated_smiles)

# Check uniqueness
uniqueness_checker = UniquenessCheck()
uniqueness = uniqueness_checker(valid_ms)

print(f"Uniqueness: {uniqueness:.2%}")

NoveltyCheck

Checks for novel molecules not in training data.

from OHMind.OHScore.metrics import NoveltyCheck

class NoveltyCheck:
    """
    Computes novelty against training dataset.
    
    A molecule bag is novel if at least one molecule does not
    appear in the training dataset.
    
    Parameters
    ----------
    training_canonical_smiles : list[str]
        List of canonical SMILES from training data
    """
    
    def __init__(self, training_canonical_smiles):
        self.training_smiles = set(training_canonical_smiles)
    
    def __call__(self, valid_molecule_bags):
        """
        Parameters
        ----------
        valid_molecule_bags : list[multiset.BaseMultiset]
            List of molecule multisets
            
        Returns
        -------
        float
            Novelty score (0-1)
        """

Example Usage

from OHMind.OHScore.metrics import NoveltyCheck, filter_valid_and_map_to_ms

# Training data
training_smiles = ["CCO", "CCCO", "c1ccccc1", "CC(=O)O"]

# Generated molecules
generated_smiles = [
    "CCO",      # In training
    "CCCCO",    # Novel
    "c1ccc(O)cc1",  # Novel
]

# Convert to valid multisets
valid_ms = filter_valid_and_map_to_ms(generated_smiles)

# Check novelty
novelty_checker = NoveltyCheck(training_smiles)
novelty = novelty_checker(valid_ms)

print(f"Novelty: {novelty:.2%}")

FCDCheck

Computes Fréchet ChemNet Distance.

from OHMind.OHScore.metrics import FCDCheck

class FCDCheck:
    """
    Computes Fréchet ChemNet Distance between training and generated molecules.
    
    FCD measures the similarity of molecular distributions using
    activations from a pretrained ChemNet model.
    
    Lower FCD indicates more similar distributions.
    
    Parameters
    ----------
    training_smi : list[str]
        List of training SMILES
    sample_size : int
        Number of samples for estimation (default: 10000)
        
    Notes
    -----
    - FCD is estimated from samples; variance may be high for small sets
    - Returns raw FCD, not the GuacaMol "FCD Score" (exp(-0.2*FCD))
    """
    
    def __init__(self, training_smi, sample_size=10000):
        pass
    
    def __call__(self, valid_molecule_bags):
        """
        Parameters
        ----------
        valid_molecule_bags : list[multiset.BaseMultiset]
            List of molecule multisets
            
        Returns
        -------
        float
            FCD value (lower is better)
        """

Example Usage

from OHMind.OHScore.metrics import FCDCheck, filter_valid_and_map_to_ms

# Training data (should be large for accurate FCD)
training_smiles = load_training_smiles("training_data.smi")

# Generated molecules
generated_smiles = load_generated_smiles("generated.smi")

# Convert to valid multisets
valid_ms = filter_valid_and_map_to_ms(generated_smiles)

# Compute FCD
fcd_checker = FCDCheck(training_smiles, sample_size=10000)
fcd_score = fcd_checker(valid_ms)

print(f"FCD: {fcd_score:.4f}")

QualityFiltersCheck

Applies quality filters for drug-likeness.

from OHMind.OHScore.metrics import QualityFiltersCheck

class QualityFiltersCheck:
    """
    Quality filters from GuacaMol paper.
    
    Filters out compounds that are:
    - Potentially unstable
    - Reactive
    - Laborious to synthesize
    - Unpleasant to medicinal chemists
    
    Uses rules from GuacaMol supplementary material and
    rd_filters package.
    
    Parameters
    ----------
    training_data_smi : list[str]
        Training SMILES for normalization
        
    Filter Rules
    ------------
    - Structural alerts (PAINS, etc.)
    - Molecular weight range
    - LogP range
    - HBD/HBA counts
    - TPSA range
    """
    
    def __init__(self, training_data_smi):
        pass
    
    def __call__(self, valid_molecule_bags):
        """
        Parameters
        ----------
        valid_molecule_bags : list[multiset.BaseMultiset]
            List of molecule multisets
            
        Returns
        -------
        float
            Fraction passing filters (normalized to training data)
        """

Example Usage

from OHMind.OHScore.metrics import QualityFiltersCheck, filter_valid_and_map_to_ms

# Training data
training_smiles = load_training_smiles("training_data.smi")

# Generated molecules
generated_smiles = load_generated_smiles("generated.smi")

# Convert to valid multisets
valid_ms = filter_valid_and_map_to_ms(generated_smiles)

# Apply quality filters
quality_checker = QualityFiltersCheck(training_smiles)
quality_score = quality_checker(valid_ms)

print(f"Quality Filter Score: {quality_score:.2%}")

Usage Examples

Complete Evaluation Pipeline

from OHMind.OHScore.metrics import (
    filter_valid_and_map_to_ms,
    UniquenessCheck,
    NoveltyCheck,
    FCDCheck,
    QualityFiltersCheck
)

def evaluate_generated_molecules(generated_smiles, training_smiles):
    """
    Comprehensive evaluation of generated molecules.
    
    Parameters
    ----------
    generated_smiles : list[str]
        Generated SMILES strings
    training_smiles : list[str]
        Training SMILES strings
        
    Returns
    -------
    dict
        Dictionary of metric scores
    """
    results = {}
    
    # Convert to valid multisets
    valid_ms = filter_valid_and_map_to_ms(generated_smiles)
    
    # Validity
    results['validity'] = len(valid_ms) / len(generated_smiles)
    
    # Uniqueness
    uniqueness_checker = UniquenessCheck()
    results['uniqueness'] = uniqueness_checker(valid_ms)
    
    # Novelty
    novelty_checker = NoveltyCheck(training_smiles)
    results['novelty'] = novelty_checker(valid_ms)
    
    # FCD (if enough samples)
    if len(valid_ms) >= 1000:
        fcd_checker = FCDCheck(training_smiles, sample_size=min(10000, len(valid_ms)))
        results['fcd'] = fcd_checker(valid_ms)
    else:
        results['fcd'] = None
    
    return results

# Example usage
training_smiles = load_smiles("training.smi")
generated_smiles = load_smiles("generated.smi")

metrics = evaluate_generated_molecules(generated_smiles, training_smiles)

print("Evaluation Results:")
print(f"  Validity:   {metrics['validity']:.2%}")
print(f"  Uniqueness: {metrics['uniqueness']:.2%}")
print(f"  Novelty:    {metrics['novelty']:.2%}")
if metrics['fcd'] is not None:
    print(f"  FCD:        {metrics['fcd']:.4f}")

Evaluating OHPSO Results

from OHMind.OHScore.metrics import (
    filter_valid_and_map_to_ms,
    NoveltyCheck,
    UniquenessCheck
)
from OHMind.OHPSO.optimizer import BasePSOptimizer

# After PSO optimization
optimizer = BasePSOptimizer.from_query(...)
swarms = optimizer.run(num_steps=10, num_track=50)

# Get generated SMILES
generated_smiles = optimizer.best_solutions['smiles'].tolist()

# Load training data
training_smiles = load_smiles("cation_training.smi")

# Evaluate
valid_ms = filter_valid_and_map_to_ms(generated_smiles)

novelty_checker = NoveltyCheck(training_smiles)
uniqueness_checker = UniquenessCheck()

print(f"Generated: {len(generated_smiles)}")
print(f"Valid: {len(valid_ms)}")
print(f"Novelty: {novelty_checker(valid_ms):.2%}")
print(f"Uniqueness: {uniqueness_checker(valid_ms):.2%}")

Batch Evaluation with Config

from OHMind.OHScore.metrics import Params, evaluating

# Create params with config
params = Params(
    processed_data_dir="data/",
    experiments_config="evaluation_config.json"
)

# Run evaluation
evaluating(params)

Example config file (evaluation_config.json):

{
    "data_dir": "results/",
    "table_format": "github",
    "tables_to_create": [
        {
            "metrics": ["validity", "uniqueness", "novelty", "FCD"],
            "rows": {
                "OHVAE": ["ohvae_generated.smi", 10000],
                "OHPSO": ["ohpso_optimized.smi", 1000],
                "Baseline": ["baseline_generated.smi", 10000]
            }
        }
    ]
}

Evaluation Pipeline

Standard Workflow

graph LR
    A[Raw SMILES] --> B[Canonicalize]
    B --> C[Filter Valid]
    C --> D[Create Multisets]
    D --> E[Compute Metrics]
    E --> F[Report Results]

Parallel Processing

from OHMind.OHScore.metrics import mapper, get_mol

# Use multiple cores
n_jobs = 8
parallel_map = mapper(n_jobs)

# Parallel validity check
mols = parallel_map(get_mol, smiles_list)
validity = 1 - mols.count(None) / len(mols)

API Reference

filter_valid_and_map_to_ms

def filter_valid_and_map_to_ms(smiles_in):
    """
    Filter valid SMILES and convert to canonical multisets.
    
    Parameters
    ----------
    smiles_in : list[str]
        Input SMILES strings
        
    Returns
    -------
    list[multiset.FrozenMultiset]
        List of canonical molecule multisets
    """

get_mol

def get_mol(smiles_or_mol):
    """
    Load SMILES/molecule into RDKit object.
    
    Parameters
    ----------
    smiles_or_mol : str or RDKit Mol
        Input SMILES string or molecule
        
    Returns
    -------
    RDKit Mol or None
        Sanitized molecule or None if invalid
    """

mapper

def mapper(n_jobs):
    """
    Returns function for parallel map calls.
    
    Parameters
    ----------
    n_jobs : int or Pool
        Number of jobs (1 for serial) or Pool object
        
    Returns
    -------
    callable
        Map function
    """

Metric Summary

Metric Range Ideal Description
Validity 0-1 1.0 Fraction of valid molecules
Uniqueness 0-1 1.0 Fraction of unique molecules
Novelty 0-1 1.0 Fraction novel to training
FCD 0-∞ 0.0 Distribution distance
Quality 0-1 1.0 Fraction passing filters

See Also


Last updated: 2025-12-22 | OHMind v1.0.0


PolyAI Team
Copyright © 2009-2025 Changchun Institute of Applied Chemistry, Chinese Academy of Sciences
Address: No. 5625, Renmin Street, Changchun, Jilin, China. Postal Code: 130022