The science and modeling evidence behind High-Performance Fermentation with Neural Networks + Digital Twin. Two ways to read it: Scientist is the full technical reference — fermentation kinetics, the soft-sensor and digital-twin math, fits, and validation. Othertist tells the same story in plain English for reviewers, investors, and funders who need to get it without operating it.
This is the science and modeling foundation under Project Ai26.10. The Fermentation Data workbook is Crush Dynamics' historical record of sixteen production batches (“totes”) of its patented grape-pomace biotransformation — a controlled, non-sterile fermentation that upgrades winery by-product into polyphenol- and fibre-rich functional food ingredients. Each tab logs four operator-controlled input_ conditions against four measured output_ variables over the fermentation cycle. The headline output, titratable acidity (TA), is the process KPI: the run is complete at TA ≥ 3% (HACCP-validated, pH ≤ 4.4). This document characterises that baseline and builds the predictive soft-sensor / digital-twin models the project is built on.
CDI's process is an aerobic acidogenic fermentation of grape pomace. A microbial community consumes residual sugars and ethanol from the pomace, drawing dissolved oxygen, and progressively acidifies the broth — titratable acidity climbs while pH falls. TA is both the product-quality endpoint and the clock: the cycle ends when TA reaches target. Because the broth is open and adjusted during the run (volume and substrate vary within and between batches), it behaves as an open, fed system rather than a sealed batch — a fact that shapes every model below.
The grape-pomace fermentation is a microbial acidification: a community of acid-forming organisms metabolises the carbon available in the pomace — residual grape sugars plus residual winemaking ethanol — and excretes organic acids. Those acids are what the process is measured by, and what protects the non-sterile product (HACCP envelope: pH ≤ 4.4, TA ≥ 3%). The exact microbial consortium and acid profile are CDI background IP; for modeling, the relevant quantity is the cumulative titratable acidity.
TA is a titration result — the total neutralisable acid in the broth, reported as g/L (acid equivalents). It aggregates every organic acid present rather than naming one, which is why it is a robust, instrument-light progress KPI:
where \(C_{\text{base}}V_{\text{base}}\) is the moles of standard base to reach the endpoint and \(E_{\text{acid}}\) the acid equivalent weight. The soft sensor's job (Project Phase 3) is to estimate TA continuously from fast in-line signals (pH, DO, temperature, airflow, agitation) so operators no longer wait on this offline lab titration.
Carbon for acid and biomass comes from two pools — pomace sugars (tracked indirectly by Brix) and residual ethanol. Generically:
The fermentation is aerobic, so its rate is frequently limited by how fast oxygen can be delivered, not by the microbes' appetite. Two quantities the proposal's digital twin estimates govern this:
Aeration level (OFF / LOW⁻ / LOW / MED, or LPM in the bioreactor/drum vessels) sets \(k_La\) and therefore the oxygen-transfer ceiling on the whole reaction. This is exactly why MED aeration produces the highest mean TA in the data (§5) — and exactly why the MPC layer is designed to optimise aeration first.
The workbook is organised as one tab per vessel. Tab names are tote identifiers (e.g. 23_982). Two auxiliary tabs — bio_reactor_0_045 and drum_0_045 — track the same tote through two vessel types whose aeration is logged in litres-per-minute (LPM) rather than categorical levels.
| Field | Role | Meaning | Units / domain |
|---|---|---|---|
input_etoh_percent | input | Ethanol content of tote | % v/v · 0–4.8 |
input_temperature_c | input | System temperature | °C · 15.6–37.8 |
input_aeration_level | input | System aeration (categorical) | OFF·LOW⁻·LOW·MED |
input_reactor_aeration_level | input | Bioreactor aeration | LPM |
input_drum_aeration_level | input | Drum aeration | LPM |
input_volume_l | input | Tote working volume | L · 650–975 |
output_time_day | output | Fermentation day | day · 0–70 |
output_brix | output | Dissolved solids (refractometric) | °Bx · 1.8–8.8 |
output_total_acidity_g_per_l | output | Titratable acidity (TA) — process KPI | g/L · 0.15–40.8 |
output_ph | output | Broth pH | – · 3.42–4.81 |
Each line is one tote's titratable acidity over its run. The common shape — slow start, steep middle, tapering plateau — is the logistic signature modeled in §6.
The raw sheets contain three classes of artefact that must be resolved before any analysis: (1) Excel carry-down formula references like =C6 in the aeration column, (2) inconsistent category spellings ("LOW ", "LOW=", "LOW-"), and (3) day formulas such as =E5+5. The normalisation algorithm:
Canonicalisation maps the observed strings OFF · LOW- · LOW · MED to an ordinal intensity scale OFF=0 < LOW⁻=1 < LOW=2 < MED=3 used as a model feature. Day-0 seed rows often carry only aeration+volume (no measured outputs yet) and are retained as NaN-output anchors.
Before modeling, five empirical regularities emerge from the pooled data. They are summarised here and each one constrains the model form in §6–7.
Regressing TA on day within each batch gives a mean slope of 0.57 g/L·day (median 0.47) and a mean linear \(R^2=0.876\). The acidification is steady and well-behaved over a run.
| Tote | n | Acidity rate (g/L·day) | Linear R² | EtOH rate (%/day) | Run (days) | Final acidity (g/L) |
|---|---|---|---|---|---|---|
| 23_053 | 11 | 1.092 | 0.987 | −0.121 | 27 | 33.5 |
| 23_995 | 17 | 1.032 | 0.947 | −0.105 | 25 | 32.9 |
| 23_978 | 17 | 0.882 | 0.929 | −0.061 | 44 | 40.8 |
| 23_980 | 12 | 0.799 | 0.955 | −0.079 | 34 | 27.7 |
| 23_037 | 15 | 0.784 | 0.899 | −0.057 | 35 | 29.6 |
| 23_982 | 24 | 0.723 | 0.939 | −0.126 | 34 | 31.3 |
| 23_998 | 22 | 0.547 | 0.968 | −0.041 | 51 | 28.2 |
| 23_013 | 12 | 0.486 | 0.793 | −0.107 | 33 | 21.1 |
| 23_014 | 15 | 0.444 | 0.907 | −0.037 | 42 | 22.3 |
| 23_055 | 15 | 0.407 | 0.891 | −0.043 | 42 | 27.7 |
| 23_036 | 20 | 0.374 | 0.952 | +0.012 | 52 | 27.8 |
| 23_986 | 22 | 0.365 | 0.971 | −0.038 | 51 | 19.6 |
| 23_991 | 18 | 0.350 | 0.849 | −0.081 | 43 | 23.9 |
| 23_045 | 27 | 0.315 | 0.822 | −0.025 | 67 | 32.4 |
| 0_045 | 23 | 0.268 | 0.612 | −0.043 | 49 | 25.2 |
| 23_032 | 27 | 0.212 | 0.590 | +0.018 | 70 | 22.5 |
Mean ethanol depletion is only −0.058 %/day — an order of magnitude too slow to explain the acid produced. Ethanol fluctuates up and down (feeding events), confirming fed-batch control.
Pearson correlations on the pooled complete-case data:
| etoh | temp | vol | day | brix | acidity | pH | |
|---|---|---|---|---|---|---|---|
| etoh | 1.00 | −0.14 | 0.33 | −0.41 | 0.18 | −0.58 | 0.08 |
| temp | −0.14 | 1.00 | −0.15 | −0.16 | −0.14 | 0.04 | 0.27 |
| vol | 0.33 | −0.15 | 1.00 | 0.29 | 0.20 | −0.07 | −0.20 |
| day | −0.41 | −0.16 | 0.29 | 1.00 | 0.49 | 0.68 | −0.62 |
| brix | 0.18 | −0.14 | 0.20 | 0.49 | 1.00 | 0.45 | −0.41 |
| acidity | −0.58 | 0.04 | −0.07 | 0.68 | 0.45 | 1.00 | −0.54 |
| pH | 0.08 | 0.27 | −0.20 | −0.62 | −0.41 | −0.54 | 1.00 |
Key reads: day↔acidity = +0.68 (time is the dominant driver); etoh↔acidity = −0.58 (acid accumulates as the ethanol band is consumed and re-fed); pH↔acidity = −0.54 and day↔pH = −0.62 (acid drives pH down, partially buffered).
| Aeration | Mean acidity | Mean pH | n rows |
|---|---|---|---|
| OFF | 14.65 | 4.07 | 82 |
| LOW⁻ | 15.76 | 4.10 | 122 |
| LOW | 14.90 | 3.96 | 53 |
| MED | 20.28 | 3.81 | 51 |
MED aeration lifts mean acidity ~35% above the other classes and pushes pH lowest — consistent with oxygen-transfer-limited kinetics (§2). The effect appears late in runs (MED is typically engaged in the high-acid finishing phase), so it is partly confounded with time.
Across totes, the within-tote acidity rate rises +0.064 g/L·day per °C of mean temperature (r = 0.39, p = 0.13). Directionally Arrhenius-like but not statistically resolved at the tote level — most totes are clustered in the 28–31 °C mesophilic optimum, limiting the temperature range over which to estimate the effect.
The model is built bottom-up from microbial growth and substrate kinetics, then reduced to the regime the data actually occupy.
Let \(X\) be active biomass, \(S\) the carbon substrate (sugars + ethanol), \(O\) dissolved oxygen, and \(A\) the titratable acidity. Microbial growth follows a double-Monod law (dual limitation by carbon and oxygen), with product (acid) inhibition:
Acid (TA) is produced coupled to growth and maintenance (Luedeking–Piret), while oxygen is supplied by aeration-driven transfer (\(k_La\)) and consumed by the culture (OUR):
Three empirical facts (§5) collapse this system to a tractable form:
| Observation | Consequence | Simplification |
|---|---|---|
| Carbon kept in surplus by feeding \(F(t)\) | \(S\gg K_S\), so carbon term ≈ 1 | Drop substrate limitation |
| Aeration sets a rate ceiling | \(O/(K_O+O)\) becomes a fixed factor \(\phi_{\text{aer}}\) | Oxygen → aeration multiplier |
| Acid climbs then plateaus (logistic shape) | Acid-inhibition term dominates the curvature | Keep \((1-A/A_{\max})\) |
With biomass quasi-proportional to acid-producing capacity, the TA balance reduces to a logistic (Verhulst) law — the canonical model for a batch filling toward its acid ceiling, and the mechanistic backbone of the hybrid digital twin's acidification-kinetics term:
where \(A_{\max}\) is the TA carrying capacity (g/L), \(k\) the intrinsic acidification rate (day⁻¹, scaled by aeration factor \(\phi_{\text{aer}}\)), and \(t_0\) the inflection day. In the early phase \(A\ll A_{\max}\) this linearises to \(dA/dt\approx kA\) → near-constant slope, explaining the strong within-batch linear fits of §5.1.
These are the TA soft-sensor prototypes the project targets at R²≥0.9 (Milestone 5), trained on the legacy data. Two complementary predictors are built: a mechanistic per-batch logistic (best for trajectory shape — the digital twin's acidification core) and a pooled multivariate regression (a transparent input→TA estimator). Both target output_total_acidity (TA); pH is a downstream cross-check (§9). On live, instrumented pilot data these become the neural-network soft sensors with the same target.
For each tote, fit \(\theta=(A_{\max},k,t_0)\) by minimising squared residuals on the closed-form logistic, with \(A_{\max}\) bounded to keep fits physical when a tote is still in its rising phase:
A transparent, deployable input→acidity map using the operator controls plus elapsed day:
Fitted coefficients (ordinary least squares, all 297 complete points):
| Term | Coefficient | Interpretation |
|---|---|---|
| intercept \(\beta_0\) | 20.05 | baseline offset (g/L) |
| day \(\beta_1\) | +0.276 | +0.28 g/L per day — the dominant driver |
| etoh \(\beta_2\) | −2.157 | high residual ethanol ⇒ acid not yet formed |
| temp \(\beta_3\) | +0.183 | warmer ⇒ faster (Arrhenius-like) |
| aerord \(\beta_4\) | +0.587 | each aeration step adds ~0.6 g/L |
| vol \(\beta_5\) | −0.016 | dilution / larger headspace, minor |
A gradient-boosted tree ensemble (200 stumps, depth 3, η = 0.05) on the same features provides a flexible benchmark and a feature-importance read:
| Feature | Importance |
|---|---|
| day | 0.614 |
| etoh | 0.149 |
| vol | 0.119 |
| temp | 0.064 |
| aerord | 0.054 |
Both estimators agree that elapsed day carries the most predictive signal, with ethanol the strongest control variable — exactly what the kinetic reduction predicts.
Mean fit quality across all 16 totes: R² = 0.904. The % saturation column (final acidity ÷ fitted \(A_{\max}\)) shows most totes finish at 40–105% of capacity — several are harvested while still climbing.
| Tote | Amax (g/L) | k (day⁻¹) | t₀ (day) | R² | % of Amax reached |
|---|---|---|---|---|---|
| 23_013 | 20.4 | 0.237 | 6.7 | 0.992 | 103 |
| 23_053 | 39.6 | 0.126 | 14.0 | 0.991 | 85 |
| 23_986 | 28.2 | 0.065 | 38.0 | 0.982 | 69 |
| 23_978 | 102.0 | 0.070 | 50.9 | 0.977 | 40 |
| 23_998 | 39.9 | 0.070 | 38.5 | 0.971 | 71 |
| 23_995 | 82.2 | 0.078 | 30.4 | 0.966 | 40 |
| 23_036 | 69.5 | 0.032 | 69.3 | 0.959 | 40 |
| 23_980 | 32.2 | 0.117 | 15.2 | 0.956 | 86 |
| 23_982 | 78.2 | 0.059 | 41.9 | 0.955 | 40 |
| 23_037 | 36.0 | 0.104 | 16.7 | 0.914 | 82 |
| 23_014 | 36.3 | 0.056 | 33.4 | 0.898 | 61 |
| 23_055 | 52.8 | 0.035 | 41.9 | 0.884 | 52 |
| 23_991 | 28.5 | 0.053 | 14.1 | 0.846 | 84 |
| 23_045 | 30.8 | 0.048 | 20.3 | 0.815 | 105 |
| 23_032 | 21.8 | 0.145 | 14.9 | 0.755 | 103 |
| 0_045 | 63.0 | 0.028 | 79.7 | 0.611 | 40 |
Fit: \(A_{\max}=39.6\) g/L, \(k=0.126\) day⁻¹, \(t_0=14.0\) day, R² = 0.991. The S-curve captures lag, exponential rise, and onset of plateau.
The acid-prediction models are validated with leave-one-tote-out (LOTO) cross-validation: each tote is predicted by a model trained on the other fifteen. This is the realistic test of generalising to a new vessel.
| Model | In-sample R² | LOTO R² | LOTO RMSE (g/L) | LOTO MAE (g/L) |
|---|---|---|---|---|
| Per-tote logistic (mechanistic) | 0.904 | —† | — | — |
| Multivariate linear | 0.595 | 0.450 | 5.53 | 4.19 |
| Gradient-boosted trees | — | 0.349 | 6.02 | 4.75 |
| Zero-order time model \(A=8.22+0.32\,\text{day}\) | 0.455 | — | 5.50 | 4.32 |
†The logistic is fit per-tote and characterises a known vessel's trajectory; it is not a blind cross-tote predictor. For predicting a brand-new tote from inputs, the linear model leads (LOTO R² 0.45, RMSE 5.5 g/L).
pH is the dissociation read-out of the organic acids produced. For weak organic acids (acetic/lactic/tartaric, pKₐ ≈ 3–4.8) the Henderson–Hasselbalch relation predicts a logarithmic dependence on acid concentration:
Fitting both the log form and a linear form to the data:
| pH model | Equation | R² | RMSE | MAE |
|---|---|---|---|---|
| Log (Henderson–Hasselbalch) | pH = 4.523 − 0.443·log₁₀(A) | 0.198 | 0.249 | 0.201 |
| Linear | pH = 4.340 − 0.0201·A | 0.291 | 0.234 | 0.192 |
End-to-end, the deployable predictor chains cleaning → feature build → dual estimator → pH sub-model → horizon forecast.
The endpoint-day inversion is the model's most useful output: given the TA target \(A^\star\) (the HACCP completion spec, TA ≥ 3% ≈ 30 g/L), solve the logistic for the day the batch reaches it — turning a 45-day wait into a forecast:
Translating the fitted model into operating guidance — what each input does and how strongly the data support it.
| Lever | Effect on acidification | Evidence strength | Operating note |
|---|---|---|---|
| Aeration ↑ | Raises rate ceiling; MED ≈ +35% mean TA | Strong (β₄>0, stratified means) | O₂-transfer limited — the MPC's first lever; step to MED for the finishing phase |
| Carbon / substrate feed | Sustains carbon so TA keeps climbing | Strong (fed-system signature) | Keep substrate in surplus; don't let the culture starve |
| Temperature | +0.064 g/L·day per °C, optimum ~28–31 °C | Moderate (r=0.39, p=0.13) | Stay mesophilic; >35 °C risks culture stress |
| Volume | Mild dilution / headspace effect | Weak (β₅ small) | Secondary; affects O₂ surface ratio |
| Time | Dominant — logistic accumulation | Very strong (r=0.68) | Use \(t^\star\) inversion to forecast the endpoint day |
This analysis is the historical-data leg of the project. Each result below feeds a specific proposal component and, in several cases, independently corroborates a proposal claim using CDI's own numbers.
| This document | Ai26.10 component | What it establishes |
|---|---|---|
| Logistic TA law \(A(t)=A_{\max}/(1+e^{-k(t-t_0)})\), mean fit R²=0.90 | Hybrid digital twin — acidification-kinetics mechanistic core | The mechanistic backbone the NN residual-learning layer corrects against |
| Linear / GBM TA estimators from inputs | NN soft sensor (Milestone 5, target R²≥0.9) | Within-batch fit already meets R²≥0.9 on legacy low-frequency data — strong feasibility signal |
| Oxygen-transfer-limited finding; MED → +35% TA | MPC optimising aeration/mixing/temperature | Confirms aeration is the highest-value control lever, grounding the \(k_La\)/OUR twin terms |
| Endpoint-day inversion \(t^\star=t_0+\tfrac1k\ln\frac{A^\star}{A_{\max}-A^\star}\) | ETA-to-target dashboard; cycle-time KPI | The mechanism behind the 45 → 10–15 day claim, expressed per-batch |
| Per-batch rate spread (k ≈ 0.03 → 0.24/day) | RSM optimal-window targeting | Quantifies the gap between slow and fast batches that closed-loop control closes |
| Cold-start LOTO R²≈0.45 vs within-batch ≈0.90; feed events unlogged | Pilot Phase 2 / Milestone 3 rationale | Independently justifies why high-frequency instrumented pilot trials are necessary |
| pH weakly predictive (R²≈0.29), buffered | Sensor-fusion soft sensor design | Shows why TA can't be read off a cheap pH probe — the soft sensor earns its keep |
| Autoencoder-ready residual structure (off-trend points) | Anomaly detection (Milestone 5) | Trend model provides the baseline against which drift/contamination is flagged |