A Julia interface to Stan's cmdstan executable
Stan.jl v9
Stan is a system for statistical modeling, data analysis, and prediction. It is extensively used in social, biological, and physical sciences, engineering, and business. The Stan language and the interfaces to execute a Stan language program are documented here.
Stan.jl illustrates how the packages available in StanJulia's ecosystem wrap the methods available in Stan's cmdstan executable.
Stan.jl v9 uses StanSample.jl v6, StanOptimize.jl v4, StanQuap.jl v4, StanDiagnose.jl v4 and StanVariational v4 and supports multithreading on C++ level. Stan.jl v9 also uses JSON.jl to generate data and init input files for cmdstan.
The StanJulia ecosystem includes 2 additional packages, StanQuap.jl (to compute MAP estimates) and DiffEqBayesStan.jl.
Options for multi-threading and multi-chaining
Stan.jl v9 is intended to use Stan's cmdstan
v2.35.0+ and StanSample.jl v6.
StanSample.jl v6 enables the use of c++ multithreading in the cmdstan
binary. To activate multithreading in cmdstan
this needs to be specified during the build process of cmdstan
. I typically create a path_to_cmdstan_directory/make/local
file (before running make -j9 build
) containing STAN_THREADS=true
. For an example, see the .github/CI.yml
script
This means StanSample supports 2 mechanisms for in parallel drawing samples for chains, i.e. on C++ level (using C++ threads) and on Julia level (by spawing a Julia process for each chain).
The use_cpp_chains
keyword argument for stan_sample()
determines if chains are executed on C++ level or on Julia level. By default, use_cpp_chains=false
.
By default in ether case num_chains=4
. See ??stan_sample
. Internally, num_chains
will be copied to either num_cpp_chains
or num_julia_chains'.
Note: Currently I do not suggest to use both C++ level chains and Julia level chains. Based on use_cpp_chains
the stan_sample()
method will set either num_cpp_chains=num_chains; num_julia_chains=1
or num_julia_chains=num_chains;num_cpp_chain=1
(the default of use_cpp_chains
is false).
Set the check_num_chains
keyword argument in the call to stan_sample()
to false
to prevent above default behavior. See the example in the Examples/RedCardsStudy
directory for more details and an example.
Threads on C++ level can be used in multiple ways, e.g. to run separate chains and to speed up certain operations.
StanSample.jl's SampleModel sets the C++ num_threads
to 4 but for compatibility with previous versions of StanJulia this is by default (use_cpp_chains=false
) not included in the generated command line, e.g. see sm.cmds
where sm
is your SampleModel.
An example of the possible performance trade-offs between use_cpp_threads
, num_cpp_chains
and num_julia_chains
can be found in the this directory.
StanJulia overview
Stan.jl is part of the StanJulia Github organization set of packages.
The use of the underlying method packages in StanJulia, i.e. StanSample.jl (the primary workhorse package), StanOptimize.jl, StanVariational.jl, StanQuap.jl and DiffEqBayesStan.jl are demonstrated in Stan.jl and in a much broader context in StatisticalRethinking.jl.
Stan.jl is not the only Stan mcmc option in Julia. Other options are PyCall.jl/PyStan and StanRun.jl. In addition, Julia provides other, pure Julia, mcmc options such as DynamicHMC.jl, Turing.jl and Mamba.jl.
On a high level, a typical workflow to use Stan.jl looks like:
using Stan
# Define a Stan language program.
bernoulli = "..."
# Create and compile a SampleModel, an OptimizeModel, etc.:
sm = SampleModel(...)
# Run the compiled Stan languauge program and collect draws:
rc = stan_sample(...)
if success(rc)
# Retrieve Stan's `stansummary` executable result:
sdf = read_summary(sm)
# Display the summary as a DataFrame:
sdf |> display
# Extract the draws from the SampleModel and show the schema:
tbl = read_samples(sm, :table)
tbl |> display
# Or converted to a DataFrame
DataFrame(tbl) |> display
# See below for reading in the draws directly into a DataFrame.
end
Above workflow returns a StanTable.Table object with all chains appended.
If e.g. a DataFrame (with all chains appended) is preferred:
df = read_samples(sm, :dataframe)
Other options are :namedtuple
, :particles
, :keyedarray
, :dimdata
, :mcmcchains
, etc. See ?read_samples
for more details. Walkthrough and Walkthrough2 show StanSample.jl in action.
References
There is no shortage of good books on Bayesian statistics. A few of my favorites are:
Gelman, Hill: Data Analysis using regression and multileve,/hierachical models
Betancourt: A Conceptual Introduction to Hamiltonian Monte Carlo
Special mention is appropriate for the new book:
which in a sense is a major update to item 3. above.