Estimation and Solvers
If you have ever felt lost choosing between FOCE, LAPLACE, IMP, or SAEM, this page is for you. We will walk through the available estimation methods, when to pick each one, and the little tweaks that make runs faster and more robust.
Estimation methods are the algorithms used to identify the best-fit model population parameters (THETAs, OMEGAs, and SIGMAs) given the specified model and the analysis dataset. The model, together with the dataset, is what we call a “problem” to be solved by our estimation methods.
The aim of this page is to:
- Give a brief overview of the available estimation methods and sensible tweaks/options
- Give an intuitive explanation of the estimation methods on a conceptual level
- Give a sense of when to choose which estimation method
In essence, there are two families of estimation methods available in NONMEM:
- Maximum-Likelihood (“Classical” or “MLE”1) methods
- Expectation-Maximization (“EM”) methods
Finding the individual ETAs is called “Maximum a Posteriori (MAP) estimation”2.
The optimal ETA values themselves are called either Empirical Bayes Estimates (EBEs), Conditional Parametric ETAs (CPEs), MAP estimates, or conditional mode estimates, symbolically denoted as
NONMEM performs MAP using the BFGS (Broyden-Fletcher-Goldfarb-Shanno) optimizer by default. This method can be changed using the OPTMAP
option, but it should generally be left at its default setting.
Subroutines
Before diving into the estimation methods, let’s quickly talk about $SUBROUTINES
.
NONMEM comes with a library of ADVAN routines (Table 1). Their function is to “advance” the kinetic system from one state time point to the next (hence “ADVAN”). Another possible name for ADVAN would have been SOLVER, because most ADVAN routines solve a set of differential equations, either analytically or by integration. We differentiate the solvers by their computational nature, as this will impact how we choose and optimize our estimation algorithm. In general, “numerical integration problems” are problems using $DES
, and “analytical problems” are those that do not (see column “Solver” in Table 1).
Show ADVAN table
Routine | Solver | Control records | Description |
---|---|---|---|
ADVAN1 |
Analytic | One Compartment Linear Model | |
ADVAN2 |
Analytic | One Compartment Linear Model with First Order Absorption | |
ADVAN3 |
Analytic | Two Compartment Linear Model | |
ADVAN4 |
Analytic | Two Compartment Linear Model with First Order Absorption | |
ADVAN5 |
Analytic | $MODEL |
General Linear Model |
ADVAN6 |
Numeric | $MODEL , $DES |
General Nonlinear Model |
ADVAN7 |
Analytic | $MODEL |
General Linear Model with Real Eigenvalues |
ADVAN8 |
Numeric | $MODEL , $DES |
General Nonlinear Model with Stiff Differential Equations |
ADVAN9 |
Both | $MODEL , $DES , $AES , $AESINITIAL |
General Nonlinear Model with Equilibrium Compartments |
ADVAN10 |
Analytic | One Compartment Model with Michaelis-Menten Elimination | |
ADVAN11 |
Analytic | Three Compartment Linear Model | |
ADVAN12 |
Analytic | Three Compartment Linear Model with First Order Absorption | |
ADVAN13 |
Numeric | $MODEL , $DES |
General Nonlinear Model using LSODA |
ADVAN14 |
Numeric | $MODEL , $DES |
General Nonlinear Model using CVODA |
ADVAN15 |
Both | $MODEL , $DES |
General Nonlinear Model with Equilibrium Compartments using IDA |
ADVAN16 |
Numeric | $MODEL , $DES |
Delay Differential Equation Model using RADAR5 |
ADVAN17 |
Both | $MODEL , $DES , $AES , $AESINITIAL |
Delay Differential Equation Model with Equilibrium Compartments using RADAR5 |
ADVAN18 |
Numeric | $MODEL , $DES |
Delay Differential Equation Model using DDE_SOLVER |
For each ADVAN, there exists a group of TRANS subroutines (numbered 1–6). Think of TRANS as a “translator” (hence “TRANS”) that swaps one parameterization for another so you can work in the language that feels most natural to you.
TRANS1
is the default for all ADVANs and means that you are using rate constants (k’s) as your main parameters.TRANS2
,TRANS3
, andTRANS4
are sometimes seen with analytic solvers, meaning you are using CL, V, and potentially Q and KA as parameters instead of rate constants.TRANS5
andTRANS6
are rarely used, so these can be disregarded for all practical purposes.
For every run, you specify one ADVAN routine and one TRANS routine in the $SUBROUTINES
record. If you are using TRANS1
, you don’t need to specify this since it’s the default.
Fine-tuning tolerance
Tolerance options may be set at $SUBROUTINES
, which will be in effect throughout the entire problem3. These options are used for the numerical ADVAN
routines (see column “Solver” in Table 1).
TOL=n
: The relative significant digits precision to which differential equations are to be integrated, so the precision is 10-TOL
. Used withADVAN6
,ADVAN8
,ADVAN9
, andADVAN13
to specify the number of digits that are required to be accurate in the computation of the drug amount in each compartment; the precise meaning depends on the particular ADVAN routine that uses it [1].ATOL=n
: The absolute significant digits precision to which differential equations are to be integrated, so the accuracy is 10-ATOL
. The default is 12. Used withADVAN9
,ADVAN13
,ADVAN14
,ADVAN15
,ADVAN16
,ADVAN17
,ADVAN18
. On occasion, you may want to reduceATOL
(usually set equal toTOL
) to improve speeds.- For the numerical ADVAN routines, internal precision is based on
TOL
, as opposed to the analytical routines, where internal precision is based onSIGL
Classical MLE methods
MLE (or “classical”) methods want to be “statistically fair” to all possibilities. These are also the types of methods used in classical statistical tests like t-test, AN(C)OVA, and even the simple mean of a set of values.
There are essentially three MLE methods available in NONMEM:
- FO (First Order)
- FOCE (First Order with Conditional Estimation)
- LAPLACE (Second order)
The “order” refers to the derivative evaluated in the Taylor expansion [2]. Therefore, FO and FOCE evaluate the first-order derivative (Jacobian matrix), or the “slope” of the objective function value (OFV). LAPLACE evaluates the second-order derivative (Hessian matrix), or the “curvature” of the OFV, requiring more computational effort than the first order.
FO stands for First Order. Is the perfect method when no ETAs are computed4, as in e.g., single TTE-models that do not contain a frailty-component. It may also be helpful as an exploratory method when fast, approximate results are desirable.
To get the ETAs, specify the option POSTHOC
(latin for “after this”).
FO can be specified as either METHOD=ZERO
or METHOD=0
.
Either one is fine, although METHOD=0
is more common.
FOCE stands for First Order Conditional Estimation. FOCE simultaneously estimates both fixed and random effects, unlike the FO method, which estimates fixed effects first, then random effects.
There is no need to specify the POSTHOC
option with FOCE. It only applies to the FO method.
It is very common to also use the INTERACTION
option, and that is then called the FOCEI (FOCE Interaction) method. The INTERACTION
option allows for correlation between the two types of random effects: ETAs (between-subject variability) and Epsilons (residual variability).
FOCE can be specified as either METHOD=CONDITIONAL
or METHOD=1
.
For explicitness, METHOD=CONDITIONAL
is preferred.
Options
Sensible options to add from start:Show code
$ESTIMATION METHOD=CONDITIONAL
1 INTERACTION
MAXEVAL=99999
2 FAST
CTYPE=4
3 NONINFETA=1
ETASTYPE=1
4 NSIG=2
5 SIGL
6 SIGLO
MSFO=root.msf
PRINT=1
- 1
-
INTERACTION
should always be used when residual error is proportional, and does not impact the analysis when the residual error is additive. Thus, always use it. - 2
-
Always use the
FAST
option, there are very few cases whereFAST
is suboptimal5. TheFAST
option benefits from MU-modeling to be more effective. - 3
-
NONINFETA=1
andETASTYPE=1
should be added when not all ETAs are used for all subjects. For example, an ETA to a KA during a fit of a subject with only IV dosing. - 4
-
NSIG <= SIGL/3
. Default isNSIG=3
, which is ok for analytical problems. But for numerical integration problemsNSIG=2
is the largest one should insist on6. - 5
-
SIGL <= TOL
(for numerical integration problems).SIGL <= 14
(for analytical problems, about the limit of double precision ). - 6
-
SIGLO <= TOL
andSIGLO >= SIGL
. By default,SIGLO=SIGL
Show code
1NSIG=2
2NOABORT
3NOHABORT
4NOTHETABOUNDTEST
NOOMEGABOUNDTEST
NOSIGMABOUNDTEST
5MCETA=1000
6SEED=14455
RANMETHOD=4P
SADDLE_RESET=1
REPEAT
- 1
-
Some trial and error for
NSIG
,SIGL
,SIGLO
may still be required given the data. - 2
- If “Hessian not positive definite”
- 3
- If there is trouble even at the beginning of an optimization with NOABORT
- 4
- If “PARAMETER ESTIMATE IS NEAR ITS DEFAULT BOUNDARY.”
- 5
- If issue with MAP (ETA) estimation
- 6
- If using MCETA
The relevant default options for FOCE are:
Show code
$ESTIMATION METHOD=CONDITIONAL
ABORT
ATOL=12
BOOTDATA=0
CONDITIONAL
CTYPE=0
ETADER=0
ETASTYPE=0
FILE=root.ext
FNLETA=1
FORMAT=s1PE12.5
LEVWT=0 ; If $LEVEL is used
MASSRESET=-1
MAXEVAL=224
MCETA=0
NOLABEL=0
NONINFETA=0
NOPRIOR=0
NOSLOW
NOSUB=0
NOTITLE=0
NSIG=3 ; Also "SIGDIGITS" or "NSIGDIGITS"
NUMDER=0
NUTS_EPARAM
NUTS_MASS
NUTS_OPARAM
NUTS_REG
NUTS_SPARAM
OMITTED
ORDER
PARAFILE
PARAFPRINT
PREDICTION
PRINT=9999
PRIORC
SADDLE_HESS=0
SADDLE_RESET=0
NOSORT
NOREPEAT
NONUMERICAL
NOCENTERING
NOINTERACTION
NOETABARCHECK
THETABOUNDTEST
OMEGABOUNDTEST
SIGMABOUNDTEST
Similar to FOCE but evaluates the second order derivative (Hessian) instead of the first order (Jacobian). Should really be called SOCE (second order conditional estimation). LAPLACE is specified as either LAPLACE
(most common) or LAPLACIAN
, as an option to FOCE.
Code for numerical stability
Instead of parameter bounds, re-parameterize. For example, if a parameter A
should have a lower bound of 0, code it in logarithmic space A = EXP(THETA)
. This way, A has a lower bound of 0, but the associated THETA is unbounded.
To guard against floating-point exceptions, simply write $ABBREVIATED PROTECT
at the beginning of your control stream. NONMEM will then use “protected” versions of EXP()
, LOG()
, SQRT()
, etc.
Small number of subjects (< 10)?
If you are analyzing very few subjects (less than 10), it’s most accurate to correct the d.f. when assessing OMEGAs. This can be done using LDF
, which stands for “loss of degrees of freedom”.
If you can MU-model all THETAs, then set LDF=-1
in $PK
to let NONMEM do the correction automatically.
If you cannot MU-model all THETAs, the most accurate is to have LDF(n)
be the number of to-be-estimated THETAs involved in MU_n
:
Precision in practice
NSIG
: The significant digits for THETAs, OMEGAs, and SIGMAs. Estimation stops when each change by less than NSIG
digits between iterations.
SIGL
: The significant digits for iOFV. The internal NONMEM SIGL
is set to 10. Rule of thumb: iOFV is evaluated to a precision of the smaller of TOL
and the internal SIGL
. If all iOFVs are of the same sign, then the OVF could have a resulting precision of SIGL
, where N is the number of subjects, for a maximum of 15, the limiting precision of double precision. Thus with 100 subjects, the actual precision that the total OFV is evaluated could be 12. One should not necessarily rely on this, so it is safest to suppose the more conservative precision of 10, for which a suitable NSIG
would be 3.
For numerical integration problems:
- Internal precision is based on user specified
TOL
, notSIGL
(see Section 1) - Often
TOL=6
or less, not 10 - The largest
NSIG
one should insist on is 2. SIGL
≤TOL
NSIG
≤SIGL
/ 3
SIGLO
: The significant digits for ETAs. By default, SIGLO
= SIGL
.
Rule-of-thumb for setting NSIG, SIGL, and SIGLO
Most problems will benefit from some trial-and-error testing. See Table 1 for info on solvers.
- Analytical solvers
SIGL
≤ 14 (about the limit of double precision)NSIG
≤SIGL
/ 3
- Numerical solvers
SIGL
≤TOL
NSIG
≤SIGL
/ 3- So, if
TOL=6
, thenSIGL
should be 6, andNSIG
should be 2
SIGLO
≤ TOL
.
CTYPE
Sometimes you may see several iterations with almost no change in OFV. What’s happening? One or more parameters are probably bouncing around in the second significant digit, with NSIG=3
. If a parameter’s influence on the objective is this tiny, the traditional convergence rule “every parameter must change by less than NSIG
significant digits” may never be met, even though the objective is already at its minimum.
To handle this, add CTYPE=4
. If the OFV doesn’t change beyond NSIG
decimal places for 10 straight iterations, the minimization is sucessful and the run terminates successfully.
SADDLE_RESET
The SADDLE_RESET
option was developed enable more robust parameter estimation through a check for local practical identifiability [3]. In the original article, the authors recommend setting SADDLE_RESET=1
when using MLE (FO, FOCE, LAPLACE).
MCETA
During ETA optimization NONMEM must pick a starting set of individual ETAs. By default it uses all zeros, but that may trap the algorithm in a local minimum. The MCETA
option tells NONMEM to test alternative starting points and keep the one that gives the lowest objective function:
MCETA=0
(default): Only ETA = 0 is tried.MCETA=1
: Compares ETA = 0 with the ETAs from the previous iteration and starts MAP with whichever produces the lower OFV.MCETA>1
: GeneratesMCETA – 1
random ETA vectors sampled from N(0, Ω) and tests them together with ETA = 0 and the previous ETAs; the vector with the lowest OFV becomes the MAP starting point.
If the algorithm ends up in local minima despite MCETA
, try increasing the value (e.g. from 100 to 1000). Since MCETA
is most impactful during the early iterations7, another option to still benefit from the robustness of MCETA
but not impact the estimation speed too much, is to only use MCETA
for the first 200 iterations or so:
As mentioned above, MLE methods are inherently reproducible since they provide only one correct answer. However, if one uses the MCETA
option, a stochastic Monte-Carlo element is introduced that benefits from additional options for reproducibility:
RANMETHOD=4P
: The4
sets the randomization method to the same as the default one in$SIMULATION
, and theP
is necessary when running NONMEM in parallel.SEED=123456789
: This enables the Monte-Carlo sequence to be reproduced. Choose any integer, but be consistent within your analysis.
Using the FAST
option together with MU_-modeling will most often improve esimation speeds.
Reducing the required significant digits might also increase estimation speeds.
Additionally, the classical NONMEM methods can be facilitated using an EM method by first having a rapid EM method such as iterative two stage be performed first, with the resulting parameters being passed on to the FOCE method, to speed up the analysis:
EM-methods
EM stand for Expectation-Maximization, and was first published in 1977. EM derives its name from the two iterative steps that define its approach:
E (Expectation) step: THETAs, OMEGAs, and SIGMAs are fixed. ETAs and their variances are calculated. Sometimes, gradients of the likelihood for THETAs and SIGMAs are also calculated.
M (Maximization) step: THETAs, OMEGAs, and SIGMAs are updated based on the ETAs and/or likelihood gradients. The M-step typically requires very little computation time.
Multiple parameter combinations could potentially explain the observed data. Rather than seeking a single “best” solution like MLE methods, EM explores the distribution of likely parameter values and identifies the most likely combination from this distribution.
It is important to understand that parameters can be estimated/refined ad infinitum, since they are continuous. Thus, a stopping (convergence) critera is needed. Each time you run your model, you might get a slightly different result depending on the stopping criteria.
Also, to avoid searching the infinite space of all parameters (would literally take forever), the different EM methods restrict the search area in different ways. In general, this also determines the speed of the algorithm. For example, IMP is generally faster than SAEM due to IMP searching a smaller space around the initial estimates then SAEM.
EM methods are most efficient when the model follows a specific statistical structure: fixed effects parameters (THETAs) only define the mean (MU) of the normal population distribution of individual parameters (phi = THETA + ETA). This requires a special coding technique called MU modeling for optimal performance.
There are essentially three EM methods available in NONMEM:
- ITS (Iterative Two-Stage)
- IMP (Importance sampling, of which IMPMAP is a special case)
- SAEM (Stochastic Approximation EM)
ITS stands for Iterative Two Stage. Is considered deterministic, as opposed to other EM-methods.
ITS is specified by:
$ESTIMATION METHOD=ITS
INTERACTION
1 NITER=50
- 1
-
NITER
(default 50) sets maximum number of iterations
Normally, it is used as a rapid exploratory pre-analysis method to facilitate parameter estimation with IMP or SAEM.
Show example
NONMEM has a direct Monte Carlo sampling method which is specified like this:
This method is the purest method for performing E-M, in that it creates completely independent sample, and there is no chance of causing bias if the sampling density is not similar enough to the conditional density (unlike IMP). However, it is very inefficient, requiring ISAMPLE
values of 10000–300000 to properly estimate the problem.
IMP stands for IMPortance sampling. The general IMP algorithm goes like this:
A MAP estimation (if
MAPITER=1
) is performed to estimate the parameters of the proposal density (aka “sampling distribution” q(x)). Can be replaced by Monte Carlo direct samplingMETHOD=DIRECT
(see Section 3.0.2.2).E-step: For every individual, ETAs are sampled
ISAMPLE
times from the proposal density. This gives the mean and variance of the ETAs that are most likely given the data.M-step: Population parameters are updated.
The algorithm goes through this E-M loop (2–3)
NITER
times, or until the convergence critera is met.
Show IMP options
$ESTIMATION METHOD=IMP
INTERACTION
1 NITER=1000
2 ISAMPLE=300
3 ISAMPEND=10000
STDOBJ=10
4 MAPITER=1
5 MAPINTER=-1
6 IACCEPT=0.0
IACCEPTL=0.01
7 DF=0
8 ISCALE_MIN=0.1
ISCALE_MAX=10
9 CTYPE=1
CINTERVAL=10
CITER=10
CALPHA=0.05
- 1
-
NITER
: Integer. Sets maximum number of iterations (default 50). Usually about 50–200 iterations are required to approach the maximum likelihood population parameters [4]. Synonymous withNSAMPLE
. - 2
-
ISAMPLE
: Integer. Number of random samples per subject used for E-step (default 300). Usually 300 is sufficient, but may require 1000–3000 for very sparse data, and when you want an OFV with low Monte Carlo noise. - 3
-
ISAMPEND
andSTDOBJ
: IfISAMPEND
is greater thanISAMPLE
, andSTDOBJ
> 0.0, then samples per subject will adjust betweenISAMPLE
–ISAMPEND
, until the stochastic standard deviation of the OFV falls belowSTDOBJ
. - 4
-
MAPITER=n
: Integer. The first n iterations MAP is use to assess parameters for the proposal density. After this, the Monte Carlo conditional means and variances of the previous iteration are used for the proposal density parameters of the present iteration. - 5
-
MAPINTER=n
: Integer. Every nth iteration, MAP is used to provide parameters to the proposal density. IfMAPINTER=-1
, then MAP will be turned on only if the OFV increases consistently over several iterations. - 6
-
IACCEPT
andIACCEPTL
: Real. LowerIACCEPT
to expand the proposal density (default 0.4). For very sparse data or highly nonlinear posterior densities (such as with categorical data), you may want to decrease toIACCEPT
to 0.1–0.3.IACCEPTL
scales a second multivariate normal distribution, to cover long tails in the posterior density (hence “L” for “long tails”). IfIACCEPT=0.0
theDF
andIACCEPT
level will be automatically determined for each subject (output inroot.imp
). IfIACCEPT=0.0
, andIACCEPTL
> 0, then a search for the bestIACCEPTL
for each subject is made, starting the testing at theIACCEPTL
value given by the user, whileIACCEPT
is fixed to 1. - 7
-
DF
: Integer. Degrees of freedom for the proposal density. DefaultDF=0
is normal distribution, else d.f. for t-distrubution. For very sparse data or highly non-linear posterior densities (such as with categorical data), you may want to setDF
to somewhere between 2–10. - 8
-
ISCALE_MIN
andISCALE_MAX
: Real. Boundaries for the scale factor used to vary the size of the variance of the proposal density in order to meet theIACCEPT
condition. On very rare occasions, the OFV varies widely, and these options may need to be reduced (e.g.,ISCALE_MIN=0.3 ISAMPLE_MAX=3
). Remember to revert these parameters to default operation on the next$EST
step:ISCALE_MIN=-100 ISCALE_MAX=-100
. Default isISCALE_MIN=0.1 ISCALE_MAX=10
for importance sampling. - 9
- See Section 3.0.2.4
AUTO=1
$ESTIMATION METHOD=IMP
AUTO=1
$ESTIMATION METHOD=IMP
INTERACTION
CTYPE=3
NITER=500
ISAMPLE=300
STDOBJ=10
ISAMPEND=10000
1 NOPRIOR=1
CITER=10
CINTERVAL=1
CALPHA=0.05
IACCEPT=0.0
ISCALE_MIN=0.1
ISCALE_MAX=10
DF=0
2 MCETA=3
EONLY=0
MAPITER=1
MAPINTER=-1
- 1
-
NOPRIOR
determines if you do want to use (0
) prior information for this estimation record or not (1
). Normally, if you specify a prior you want to use it, so setNOPRIOR=0
. - 2
-
MCETA
works the same during MAP as it does for the MLE methods.
AUTO=2
AUTO=2
is same as AUTO=1
, with additional:
- 1
-
GRDQ
: The gradient quick option. Used for THETAs that are not MU-modeled, for SIGMAs, and THETAs that model RUV. - 2
-
DERCONT
: Derivative continuity.DERCONT=1
might be helpful in cases of extreme discontinuity when varying certain THETAs by even just a small amount in the model results in a large change in OFV.
Monte Carlo noise
Remember, EM algorithms don’t find a single set of parameters that are the best fit. They find a distribution of parameter sets that maximize the expectation. Often those parameter sets are very similar and only differ in the 6th or 7th decimal place. But sometimes in complex8 models with poorly estimated parameters they can differ by larger numbers.
Show example
The iterations for the initial ITS
step seems in this case to be quite stable with some artifacts:
$ESTIMATION METHOD=ITS
INTERACTION
LAPLACE
NITER=200
SIG=3
SIGL=6
NOHABORT
CTYPE=3
NUMERICAL
SLOW
$ESTIMATION METHOD=IMP
INTERACTION
LAPLACE
ISAMPLE=1000
NITER=1000
SIG=3
SIGL=6
NOHABORT
CTYPE=3
IACCEPT=0.4
MAPITER=0
RANMETHOD=3S2P
$COVARIANCE UNCONDITIONAL MATRIX=S TOL=12 SIGL=12 SLOW
...
iteration 175 OBJ= 4693.4674554341409
iteration 176 OBJ= 4694.2296104065535
iteration 177 OBJ= 4693.7753507970829
iteration 178 OBJ= 4693.9600270372885
iteration 179 OBJ= 4693.5732455834705
iteration 180 OBJ= 4693.6386423202493
iteration 181 OBJ= 4693.6215390721527
iteration 182 OBJ= 4693.6006496138452
iteration 183 OBJ= 4693.7877620448235
iteration 184 OBJ= 4694.1591757809929
iteration 185 OBJ= 4693.2614956897451
iteration 186 OBJ= 4693.5641640401127
iteration 187 OBJ= 4693.5575289919379
1iteration 188 OBJ= 4495.6489907149398
iteration 189 OBJ= 4693.7711764252363
iteration 190 OBJ= 4693.6281175153035
iteration 191 OBJ= 4694.1171774559862
iteration 192 OBJ= 4693.7908707845536
iteration 193 OBJ= 4693.7709264605819
iteration 194 OBJ= 4495.9262902940209
iteration 195 OBJ= 4693.3321354894242
iteration 196 OBJ= 4694.3177205227348
iteration 197 OBJ= 4694.1301486616576
iteration 198 OBJ= 4694.2898587322170
iteration 199 OBJ= 4693.8304358341920
iteration 200 OBJ= 4691.6818293505230
#TERM:
OPTIMIZATION WAS NOT COMPLETED
- 1
- Small “spikes”
The IMP step can be less stable:
iteration 120 OBJ= 4314.8310660241377 eff.= 446. Smpl.= 1000. Fit.= 0.96389
iteration 121 OBJ= 4326.9079856676717 eff.= 448. Smpl.= 1000. Fit.= 0.96409
1iteration 122 OBJ= 4164.6649529423103 eff.= 479. Smpl.= 1000. Fit.= 0.96392
iteration 123 OBJ= 4299.9887619753636 eff.= 432. Smpl.= 1000. Fit.= 0.96395
iteration 124 OBJ= 4303.9571213327054 eff.= 399. Smpl.= 1000. Fit.= 0.96349
iteration 125 OBJ= 4328.9835950930074 eff.= 417. Smpl.= 1000. Fit.= 0.96423
iteration 126 OBJ= 4304.3861595488252 eff.= 550. Smpl.= 1000. Fit.= 0.96392
iteration 127 OBJ= 4291.0862736663648 eff.= 422. Smpl.= 1000. Fit.= 0.96430
iteration 128 OBJ= 4326.2378678645500 eff.= 407. Smpl.= 1000. Fit.= 0.96409
iteration 129 OBJ= 4157.5352046539456 eff.= 406. Smpl.= 1000. Fit.= 0.96404
iteration 130 OBJ= 4332.6894073732456 eff.= 399. Smpl.= 1000. Fit.= 0.96399
iteration 131 OBJ= 4357.5343346793761 eff.= 493. Smpl.= 1000. Fit.= 0.96414
Convergence achieved
iteration 131 OBJ= 4336.1893012015007 eff.= 417. Smpl.= 1000. Fit.= 0.96369
#TERM:
OPTIMIZATION WAS COMPLETED
- 1
-
It is not unusual to see this large Monte Carlo noise in the OFV estimate. It indicates that the number of samples (
ISAMPLE
) may not be high enough. Consider restrictingISCALE_MIN
andISCALE_MAX
as well.
During estimation with IMP, the iteration output contains:
OBJ
: The objective functioneff.
: Average effective number of samplesSmpl.
: Actual samples (which is usuallyISAMPLE
)Fit.
: Average fitness.
Fit.
tells you how Normal the posterior is. If Fit.
is approximately 1, you might as well use FOCE. If Fit.
is more like 0.6, then IMP is a good choice. It is the average absolute difference between the posterior density and proposal density among the samples, and indicates to what extent that the proposal density matches the posterior density.
The scaling of the variance of the proposal density is adjusted (limited by ISCALE_MIN
and ISCALE_MAX
) so that eff.
divided by Smpl.
is approximately equal to IACCEPT
. If you find that this is not the case, then tweak something.
METHOD=IMPMAP
is equivalent to METHOD=IMP MAPITER=1 MAPINTER=1
. It is thus like IMP, with the added assistance of MAP estimation on every E-step.
SAEM stands for Stochastic Approximation Expectation-Maximization. Like IMP, random samples are generated from normal proposal densities. However, instead of always centered at the mean or mode of the posterior density, the proposal density is centered at the previous sample position. The general SAEM algorithm goes like this:
- “Stochastic”, “exploratory”, or “burn-in” phase (iterations set by
NBURN
).- Sampling of the ETA proposal density
- Population parameters are updated based on the averaging of ETAs from the previous step in this phase.
- “Accumulation” or “smoothing” phase (iterations set by
NITER
).- Sampling of the ETA proposal density (as in burn-in phase)
- Population parameters are updated based on the averaging of ETAs from all previous steps in this phase.
This method is useful for highly non-normal posterior densities9.
During the exploratory phase, SAEM acts like a search light that finally will oscillate around the maximum likelihood. Thus, SAEM does not need as large of a starting density (“umbrella”) as IMP. See here for an excellent primer on SAEM.
The size of the explored space depends on the standard deviations of the random effects and the residual error. Exploring a large space increases the chances of finding the global maximum. Thus, it is recommended to choose large initial values for the OMEGAs and the SIGMAs.
Show SAEM options
$ESTIMATION METHOD=SAEM
AUTO=1
$ESTIMATION METHOD=SAEM
INTERACTION
1 NBURN=500
2 NITER=1000
3 MCETA=0
4 ISAMPLE=2
5 ISAMPEND=10
6 ISAMPLE_M1=2
7 ISAMPLE_M1A=0
8 ISAMPLE_M1B=2
9 ISAMPLE_M2=2
10 ISAMPLE_M3=2
11 IACCEPT=0.4
12 ISCALE_MIN=1.0E-06
ISCALE_MAX=1.0E+06
13 NOCOV=0
14 DERCONT=0
15 CONSTRAIN=1
- 1
-
NBURN
: Maximum iterations for the exploratory phase. Should be high enough for convergence to occur. Default 1000. - 2
-
NITER
: Maximum iterations for the smoothing phase. Default 1000. - 3
-
MCETA
: If greater thanISAMPLE_M1
, then for the first iteration of SAEM,ISAMPLE_M1
,ISAMPLE_M2
, andISAMPLE_M3
will be set toMCETA
, to facilitate a robust initial search for reasonable parameter values. - 4
-
ISAMPLE
: Integer. Number of samples per individual. More samples means less stochastic noise. Should have at least 50 total samples per iteration. For datasets with more than 50 individuals,ISAMPLE=1
is usually fine. - 5
-
ISAMPEND
: Integer. NONMEM will perform 200 iterations to determine the bestISAMPLE
value, withISAMPEND
as an upper bound. - 6
-
ISAMPLE_M1
: Number of “mode 1” iterations perISAMPLE
. Should usually be left at default (2). - 7
-
ISAMPLE_M1A
: Number of ” 1A” iterations perISAMPLE
. Should usually be left at default (0). - 8
-
ISAMPLE_M1B
: Number of “mode 1B” iterations perISAMPLE
. Should usually be left at default (2). - 9
-
ISAMPLE_M2
: Number of “mode 2” iterations perISAMPLE
. Should usually be left at default (2). - 10
-
ISAMPLE_M3
: Number of “mode 3” iterations perISAMPLE
. Should usually be left at default (2). - 11
-
IACCEPT
: Goal fraction of samples to be accepted. Should usually be left at default (0.4). - 12
-
ISCALE_MIN
andISCALE_MAX
: Boundaries for the scale factor used to vary the size of the variance of the proposal density in order to meet theIACCEPT
condition. On very rare occasions, the OFV varies widely, and these options may need to be reduced (e.g.,ISCALE_MIN=0.001 ISAMPLE_MAX=1000
). Remember to revert these parameters to default operation on the next$EST
step:ISCALE_MIN=-100 ISCALE_MAX=-100
. Default isISCALE_MIN=1.0E-06 ISCALE_MAX=1.0E+06
for MCMC sampling. - 13
-
NOCOV
: If covariance estimation is not desired for a particular estimation step, setNOCOV=1
. Normally, after SAEM you perform a refinement step (Section 3.0.2.5), soNOCOV=1
can be set during the SAEM and changed back toNOCOV=0
for the IMP refinement. - 14
-
DERCONT
: Derivative continuity. Default 0.DERCONT=1
might be helpful in cases of extreme discontinuity when varying certain THETAs by even just a small amount in the model results in a large change in OFV. - 15
-
CONSTRAIN
: A built-in simulated annealing algorithm. It starts the OMEGAs at 1.5 times the initial values, and then control the rate at which the OMEGAs shrink during each iteration.CONSTRAIN=7
performs simulated annealing on both OMEGA (zero and non-zero) and SIGMA parameters. To mimic Monolix, set OMEGA and additive SIGMA inital values to 0.66, and proportional SIGMA to 0.22. Default 1.
AUTO=1
$ESTIMATION METHOD=SAEM
AUTO=1
$ESTIMATION METHOD=SAEM
INTERACTION
CTYPE=3
NITER=1000
NBURN=4000
ISAMPEND=10
NOPRIOR=1
CITER=10
CINTERVAL=0
CALPHA=0.05
IACCEPT=0.4
ISCALE_MIN=1.0E-06
ISCALE_MAX=1.0E+06
ISAMPLE_M1=2
ISAMPLE_M1A=0
ISAMPLE_M1B=2
ISAMPLE_M2=2
ISAMPLE_M3=2
CONSTRAIN=1
EONLY=0
ISAMPLE=2
MCETA=0
AUTO=2
AUTO=2
is the same as AUTO=1
, with the additional options:
Initial estimates
It is recommended to set initial variability estimates high in order for SAEM to have the possibility to explore the domain.
Without annealing
$OMEGA 1.0
$SIGMA 1.0 ; additive
$SIGMA 0.3 ; proportional
With annealing
$OMEGA 0.66
$SIGMA 0.66 ; additive
$SIGMA 0.22 ; proportional
By default, no MAP estimation is performed with SAEM (MAPITERS=0
). To get good individual parameter values near the mode of the posterior density for the first iteration of SAEM, you can set MAPITERS=1
.
The EM-methods are already quite robust in that they can handle many different estimation problems.
Use full OMEGA blocks
Full OMEGA blocks are generally preferred in EM-based estimation, as EM handles them more reliably than FOCE [4]. Including off-diagonal elements does not compromise stability, on the contrary, off-diagonals allow the algorithms to more efficiantly search the parameter space. Off-diagonals should therefore not be fixed to 0 just because of high RSEs (>100%) (as you would with FOCE). If they were fixed in a previous FOCE run due to rounding errors or failed $COVARIANCE
steps, they should be estimated in EM. Only consider FIX:ing off-diagonals back to 0 if EM OFV is highly variable, or if the covariance matrix is non-positive-definite.
For uncertain initial estimates
- PsN program
parallel_retries
can be useful METHOD=CHAIN
: Create random initial values of THETAS and OMEGAS. Applied only for the estimation step$CHAIN
: To introduce initial THETAs OMEGAs and SIGMAs that will cover the entire scope of a given problem.
Monte Carlo Direct Sampling can help IMP if the first iteration fails due to MAP issues. Here a few iterations of DIRECT define the sampling distribution for IMP:
$ESTIMATION METHOD=DIRECT
INTERACTION
ISAMPLE=10000
NITER=3
$ESTIMATION METHOD=IMP
INTERACTION
ISAMPLE=1000
NITER=50
1 MAPITER=0
- 1
-
MAPITER=0
tells IMP to skip its own MAP search and instead use the sampling distribution that DIRECT just estimated.
If using SAEM: MCETA
With SAEM: If the option MCETA
is set to greater than ISAMPLE_M1
, then for the first iteration of SAEM, ISAMPLE_M1
, ISAMPLE_M2
, and ISAMPLE_M3
will be set to MCETA
, to facilitate a robust initial search for reasonable parameter values.
If using SAEM: $ANNEAL
If you run SAEM and suspect that your inital values might be far from the solution, you could consider simulated annealing. By default, NONMEM’s gradient-based methods behave nicely when you start close to the solution. When you don’t, annealing is more suitable. Here’s the intuition:
- Pick a generous starting value for OMEGAs (e.g., 1).
- Each iteration, those values shrink.
- Eventually the OMEGAs will reach 0 (as intended).
Use $ANNEAL
to set the initial diagonal OMEGA values, and $ESTIMATION CONSTRAIN=n
to say what gets annealed.
For example: $ANNEAL 1-3,5:0.3 6,7:1.0
, will set initial values of OMEGA(1,1)
, OMEGA(2,2)
, OMEGA(3,3)
, and OMEGA(5,5)
to 0.3, while initial OMEGA(6,6)
and OMEGA(7,7)
are set to 1.0.
Values for CONSTRAIN=n
are:
- 0 or 4: No annealing
- 1 or 5: Anneal OMEGA. Default is 1.
- 2 or 6: Anneal SIGMA
- 3 or 7: Anneal OMEGA & SIGMA
When CONSTRAIN
≥ 4, annealing is also applied to diagonal OMEGAs fixed to 0.
You can even code a custom annealing algorithm (see below)! For example, one use would be to slow the rate of reduction of the diagonal elements of the OMEGA values during the burn-in phase of the SAEM method (as is done in Monolix). This algorithm was derived from a template of CONSTRAINT.f90
in the ..\source
directory.
Show FORTRAN code for custom algorithm
File custom_anneal.f90
SUBROUTINE CONSTRAINT(THETAS, NTHETAS, SIGMA2, NSIGMAS, OMEGA, NOMEGAS, ITER_NO)
USE SIZES, ONLY: ISIZE, DPSIZE
IMPLICIT NONE
INCLUDE '../nm/TOTAL.INC'
INTEGER(KIND = ISIZE) NTHETAS, NSIGMAS, NOMEGAS, ITER_NO
INTEGER :: I, J, ITER_OLD
DATA ITER_OLD /-1/
REAL(KIND = DPSIZE) :: OMEGA(MAXOMEG,MAXOMEG), THETAS(MAXPTHETA), SIGMA2(MAXPTHETA)
REAL(KIND = DPSIZE) :: OMEGO(MAXOMEG)
SAVE
!---------------------------------------------------------------------------
IF (SAEM_MODE == 1 .AND. IMP_MODE == 0 .AND. ITS_MODE == 0 .AND. ITER_NO < 200) THEN
IF (ITER_NO /= ITER_OLD .OR. ITER_NO == 0) THEN
! During burn-in phase of SAEM, and when a new iteration occurs
! (iter_old<>iter_no)
! store the present diagonals of OMEGAs
ITER_OLD = ITER_NO
DO I = 1, NOMEGAS
OMEGO(I) = OMEGA(I,I)
ENDDO
ENDIF
IF (ITER_NO /= 0) THEN
DO I = 1, NOMEGAS
! Monolix style algorithm needed to "slow down" the reduction of OMEGA
! The expansion of OMEGA should be less with each iteration.
OMEGA(I,I) = MAX(OMEGA(I,I), 0.95D0 * OMEGO(I))
ENDDO
ENDIF
ENDIF
RETURN
!---------------------------------------------------------------------------
END SUBROUTINE CONSTRAINT
1$SUBROUTINES OTHER=custom_anneal.F90
$PK
MU_1 = THETA(1)
EMAX = EXP(MU_1 + ETA(1))
MU_2 = THETA(2)
ED50 = EXP(MU_2 + ETA(2))
IPRED = EMAX * DOSE / (ED50 + DOSE)
$THETA 4.1 ; 1. EMAX
$THETA 6.9 ; 2. ED50
$OMEGA 0 FIX
$OMEGA 0.1
$ANNEAL 1:0.3
$ESTIMATION METHOD=SAEM
INTERACTION
CONSTRAIN=5
$ESTIMATION METHOD=IMP
INTERACTION
EONLY=1
2 CONSTRAIN=0
- 1
-
The custom annealing is implemented using
$SUBROUTINES OTHER=custom_anneal.f90
. - 2
- Turn annealing off for the refinement step.
THETAs with no ETAs
One alternative is to use ETAs anyways but fix the OMEGA to a low value like 0.0025 This corresponds to 5% for log-normally distributed parameters. With this method, the EM algorithm can be used, and the variability is kept small.
Another option would be to implement $ANNEAL
with the CONSTRAIN
option set to ≥4 in $ESTIMATION
.
Convergence test
As stated before, you need to set up a criteria for when the EM-method should stop (aka “converge” or “terminate”). Basically what this does is set up the instructions for a linear regression that tests if the slope is not statistically different from zero.
Show convergence test options
CTYPE
: Once the OFV is stationary, then often the analysis has converged statistically, meaningCTYPE=1
is enough.CINTERVAL=n
: Integer. Every n iterations is saved for convergence test (seeCITER
). IfCINTERVAL=0
, then NONMEM will decide10.CITER=n
: Integer. Number of latest submitted iterations (seeCINTERVAL
) on which to perform the regression. CITER=10 is the default. Also coded asCNSAMP
.CALPHA=x
: Real, 0.01–0.05. Alpha error rate to use on linear regression test to assess statistical significance. The default value is 0.05.
IMP: IMP iterations tend to be statistically uncorrelated, and so it is reasonable to have CITER=10
consecutive iterations (CINTERVAL=1
) tested at CALPHA=0.05
.
SAEM: Iterations can be highly correlated. To properly detect a lack of change in parameters, you may want to test every 10–100th iteration (CINTERVAL
=10 to 100), so that the linear regression on parameter change is spread out over a larger segment of iterations. To mimic Monolix, set CTYPE=3
, CINTERVAL=15
, and CITER=10
.
Give a rather high value to NBURN
(1000–10000 for SAEM/BAYES) or NITER
(200–1000 for ITS/IMP/IMPMAP) so you don’t run out of iterations before the method converges.
An alternative method to convergence testing is to set NBURN
to a very high number (10000), monitor the change in SAEMOBJ
, and enter ctrl-K (see section I.15 Interactive Control of a NONMEM batch Program) when you feel that the variations are stationary, which will end the burn-in mode and continue on to the statistical/accumulation mode.
It is better to provide a large NBURN
number, and end it at will with ctrl-K, or allow the convergence tester to end it, rather than to have a small NBURN
number and have the burn-in phase end prematurely.
Refinement step
When using EM methods, it is adviced to include a “refinement step” in the form of an additional subsequent $ESTIMATION METHOD=IMP EONLY=1
record to obtain more precise final estimates. This will take the population parameter estimates and calculate more precise estimates of the objective function value (just the E-step).
Show refinement step example
- 1
- The higher number of samples (3000–10000) should give a more stable result (although the run time of each iteration will increase significantly).
- 2
-
Number of iterations should be low, 5–10.
IMP EONLY=1
needs a few iterations to allow the proposal density to become stationary. This is observed by the OFV stabilizing (oscilating) around a particular value. Assessing the mean and variance of the iterated OFV will give a more accurate estimation of the final OFV11. - 3
- Perform only the E-step (do not update population parameters)
- 4
- To maintain continuity of the estimation from the previous step.
Random number generator (RNG)
NONMEM’s internal RNG is controlled with RANMETHOD=[n|M|S|m|P]
.
The “Knuth” RNG is the default in $ESTIMATION
, specified as RANMETHOD=3
(n=3
). To reduce sampling noise, use the Sobol method (S
, RANMETHOD=3S
) with Faure-Tezuka type scrambling (m=2
, RANMETHOD=3S2
). That added stability often lets you reach reliable results with IMP using fewer ISAMPLE
s.
When RANMETHOD=S
is combined with a t-distribution (DF>0), DF
should be limited to 1, 2, 4, 5, 8, or 10. However, discretizing the d.f. does not work if you use the IACCEPT=0.0
to automatically determine the DF
.
In this case, use IACCEPTL
as described for IMP instead of DF
.
Running NONMEM in parallel? Append P
so every subject keeps its own seed path, independent of the number of processes: RANMETHOD=P
.
So, the final recommended RANMETHOD
for IMP becomes: RANMETHOD=3S2P
Seed for RNG
Using the SEED
option is a requirement for reproducible results. You can use any SEED number, and the default is SEED=14455
. Its also advisable to keep the same SEED throughout your analysis, i.e., in all the models, to track the model fit and not depend so much on stochastic happenstance.
Parallel NONMEM
For stochastic methods (MCETA, DIRECT, IMP, IMPMAP, and SAEM), distribution of subjects to different nodes for computations is dependent on number of processors. This leads to differences in sequences of random numbers generated during computations that could make the results dependent on the number of processors.
Moreover, with the parallel processing option PARSE_TYPE=4
, the results of the stochastic methods may not be exactly reproducible even when repeated on the same number of processors, because the distribution of the subjects to each processor may depend on the computer load. The option PARSE_TYPE=1
(equal number of subjects at each node) guarantees the results to be independent of the computer load. Still, the results will depend on the number of used processors.
Copy this file: /opt/local64/nonmem/nm_7.4.4_g510/run/mpilinux8.pnm
Into my_own_parafile.pnm
and put it somewhere resonable. Then change PARSE_TYPE
to 1:
When running models, use your own parafile: -parafile=my_own_parafile.pnm
- MAP-estimation is always parallelized properly.
- Always use
RANMETHOD=P
- Reduce
ATOL
in$SUBROUTINES
(usually set equal toTOL
) may improve speeds. - Keeping the
ISAMPLE
low makes for faster iterations, however, it will result in higher stochastic noise. - MU-model as many THETAs as possible. Even if a parameter doesn’t have IIV, MU-model it anyway. Sometimes, it might be impossible to MU-model a THETA. If you can, rerapameterize so that they can be MU-modeled, or keep a maximum of three non-MU-modeled THETAs.
When to use which
If you are unsure which method to use, start with MLE methods like FOCE or LAPLACE, then turn to EM.
The reason is simple: with MLE, there is one correct answer; a single best set of parameters for a data set given a statistical model, while for EM methods, there are several answers (i.e., parameter estimates). This makes for better reproducibility of model fit, and less ambiguity on “best fit”.
With that said, Table 2 provides some rule-of-thumbs of when to consider which estimation method.
Scenario | FOCE | LAPLACE | IMP | IMPMAP | SAEM |
---|---|---|---|---|---|
Very sparse data | x | ||||
Sparse data | x | x | x | x | |
Rich data | x | x | x | x | x |
Very rich data | x | x | x | ||
Categorical data | x | x | |||
Simple models | x | x | |||
Complex models | x | x | x |
For simple one- and two-compartment PK models, FOCE is the preferred method, whereas more complex models and/or very sparse datasets typically benefit from EM methods [5–7]. In general, EM tends to perform better than MLE in the following situations:
Relatively complex models, such as:
- Parent-metabolite
- Non-linear clearance
- Target-mediated drug disposition (TMDD) [1]
- Mixture models (
$MIX
) - Many covariates (like during FFEM)
- Multiple PD endpoints or sequential PD biomarkers
- Categorical endpoints
- Count data
- Binary data (consider SAEM)
Data limitations, such as:
- Sparse or incomplete datasets
- Lack of observations across the full dose–response range
- Insufficient PK data to identify all compartments
Think about “sparse data” as missing important parts of the picture. Perhaps we miss the peak concentration or don’t have high enough doses to see the top of the dose-response curve. In that situation, classic MLE can have a tough time because the likelihood surface is poorly defined.
That’s where EM shines ✨. EM fills in the gaps of the missing data, so the parameter estimates stay much more stable than with MLE.
However, EM is not a magic fix. If you’re forcing the wrong model onto the data, say, fitting a two-compartment PK model when the observations clearly behave as one compartment, the problem is your model, not the algorithm.
That’s different from using an EM algorithm on a complex model where you know that you have PKPD response, but your dose or exposure range is limited.
OFVs between MLE and EM methods are incomparable.
Issuing Multiple Estimations within a Single Problem
A sequence of two or more $ESTIMATION
statements within a given problem will result in the sequential execution of separate estimations. For example:
$THETA 0.3 0.5 6.0
$OMEGA 0.2 0.2 0.2
$SIGMA 0.2
1$ESTIMATION METHOD=0
MAXEVAL=9999
NSIG=3
2$ESTIMATION METHOD=CONDITIONAL
NSIG=4
- 1
- First estimation step
- 2
- Second estimation step
Will first result in estimation of the problem by the first order method, using as initial parameters those defined by the $THETA
, $OMEGA
, and $SIGMA
statements. Next, the first order conditional estimation method will be implemented, using as initial parameters the final estimates of THETA, OMEGA, and SIGMA from the previous analysis.
Up to 20 estimations may be performed within a problem.
References
Footnotes
“ML”-methods (Machine Learning) are something else entirely↩︎
In pharmacometrics, also called “Mode a Posteriori”. Essentially the same, since the mode is the maximum of a unimodal distribution.↩︎
An alternative method is to use the
$TOL
record, where options for each compartment may be entered.↩︎Credit: Adrian Dunne↩︎
Credit: Robert Bauer↩︎
Credit: Robert Bauer↩︎
Credit: Robert Bauer↩︎
Complex PK/PD problems are those that require ODEs or problems with more than six model parameters to be estimated↩︎
very sparse or categorical data↩︎
I frequently use this↩︎
Credit: Jon Moss↩︎