Landscape change analysis with MOLUSCE - methods and algorithms

Terminology used in this document

MOLUSCE is a QGIS plugin for Land Use Change Evaluation. MOLUSCE stands for Methods Of Land Use Chande Evaluation. Briefly, tThe plugin performs the following process:

Takes raster of land use categories for the past time, raster of land use categories for the present time and rasters of explanatory variables or factors.
Trains a model that predicts land use changes.
Predicts future land use changes via model, current state of land use and current factors.

Model is an algorithm that is used for prediction of land use changes (transitions).
State raster is one-band raster that contains landuse categories encoded in the raster's pixels.
Input state raster is one-band raster that contains landuse categories of the past. It is an input for used model.
Factor rasters are rasters of explanatory variables. The rasters can be one-band or multy-band. These are inputs for used model.
Output state raster is one-band raster that contains landuse categories of the present. It is the desired result of prediction for model (a user trains the model to predict (input raster, factors) -> output raster).
Transition is a information about land use change. Every change (for example forest -> non-forest) can be viewed as a transition of land use categories (encoded as pixels) from Input state raster to Output state raster.
Change map is a one-band raster that stores information about transitions.

General structure of the plugin

MOLUSCE consists of several parts. The most important are:

GUI modules - implement user interface
Utility modules
- Data Provider - provides procedures for reading/writing raster data and similar utility functions
- Cross Tabulation - provides functions for creating contingency tables
- Sampler - provides sampling procedure
Algorithmic modules:
- Area Analysis - provides procedures of change searching, making change maps
- Modeling - provides submodules for modeling relation between input-output data
- Simulation - provides procedure of land change simulation
- Validation - provides statistic functions and procedures for validation of simulation result

This manual gives general overview of the modules. The description is valid for structure of plugin version <= 1.x.x

Utility modules

Utility modules are used in many other modules of the plugin. The most important utility modules such as DataProvider and Sampler are discussed in this section.

Data Provider

This module provides data structure for internal storing of raster data. It uses numpy masked arrays as data store. Data Provider provides methods for:

Creation and storing rasters: reading data from file and manage no-data values, creating new rasters, saving data.
Access to the data and data manipulation: Read particular raster band, replace a band by new array of pixels, get statistics of raster's bands, normalize the data for particular algorithm, reads geometry and geography related information from the raster variable, etc.
Comparing geodata objects: check raster sizes, projections and geo transform objects between rasters.

Sampler

Sampler is the module that performs sampling procedure. A sample is a set of input data and corresponding output data that has to be predicted via a model.

Sampler works with single pixels and a neighborhood of pixels. This is means that a user can take into account a set of pixels around current pixel. The implementation of the algorithm uses square neighborhood. For example if the user set up 1-size neighborhood then sampling procedure reads the current pixel and all pixels located immediate vicinity (3x3 window size). If the user set up 2-size neighborhood then sampling procedure reads the current pixel and all pixels located in 5x5 window size and so on. Every neighborhood is stored as vector (9 elements, 25 elements and so on). That sampling is performed for every input raster (initial state raster and factors).

Therefore, a sample contains:

coordinates of the pixel,
input data (consists of 2 parts):
- state pixel vector is the neighborhood that is read from 1-band raster, this raster contains initial states (categories). The raster is divided into set of dummy variables before sampling (see below).
- factor pixel vectors are the neighborhoods extracted from factor rasters (they can be multiband).
output data is read from 1-band raster, this raster contains final states.

There are several sampling modes:

"All". This mode allows store all pixels of the input and output rasters. The mode of sampling needs big amount of RAM and CPU time if input rasters are big.
"Random". This mode allows store random selected samples: if output raster (target raster) has N pixels, then the every pixel has probability of selection 1/N.
"Stratified". This mode allows undersampling of major categories and/or oversampling of minor categories. If C is desired count of samples and K is number of output categories, then the sampling procedure select K random groups of pixels of equal size C/K. Every group will contain pixels of one certain category.

Cross Tabulation

Cross Tabulation is a module that is used in several places of the workflow. The main purpose of the module is creating of contingency tables. For example it is used during creating of transition matrix. In this case the method calculates transition statistic -- how many pixels has been changed. The result is the table:

	Category 1	Category 2	...	Category N
Category 1	Count of unchanged pixels of Category 1	Count of transition Category 1 - > Category 2	...	Count of transition Category 1 - > Category N
Category 2	Count of transition Category 2 - > Category 1	Count of unchanged pixels of Category 2	...	Count of transition Category 2 - > Category N
...	...	...	...	...
Category N	Count of transition Category N - > Category 1	Count of transition Category N - > Category 2	...	Count of unchanged pixels of Category N

Algorithmic modules

Area Analysis

The main purpose of area analysis module is change map calculation. The module uses the next scheme of transition encoding:

	Category 1	Category 2	...	Category N
Category 1	0	1	...	N -1
Category 2	N	N + 1	...	2N -1
...	...	...	...	...
Category N	(N-1)*N	(N-1)*N + 1	...	N*N -1

For example, if in the state rasters category "Forest" coded as "3", category "Non-forest" coded as "2", then transition "forest" -> "non-forest" in the change map will be coded as "7".

For example, for 4 classes (N=4), there are 4*4=16 possible transitions:

	Agriculture	Forest	Urban	Water
Agriculture	0	1	2	3
Forest	4	5	6	7
Urban	8	9	10	11
Water	12	13	14	15

Output of Area Analysis stage is a raster where each pixel is assigned an integer value indicating its transition.

Modeling

A user has a number of factors, init state raster and final state raster. The goal of modeling stage is to create a model that can predict land use changes. Before to discus partial predicting methods we need to explain the common preparing stage.

The first subsection of this section discuses types of input and output data and necessary manipulations with the data, the next subsections describe concrete realizations of predicting models.

Data types and initialization procedures

The next assumptions was used during developing:

Initial and final state rasters are categorical one-band rasters, therefore a change map is categorical one-band raster too.
Factor rasters are ordinal or continuous rasters (one-band or multiband), but not categoraical.

A user fits model to predict land use changes, therefore the user has initial state raster and factors as input data and the change map as target variable (initial states, factors -> change map).

Initialization of sampling procedure

Logistic regression and Multilayer perceptron predictors uses sampling procedure. This procedure collects samples that are used during fitting of the model. But Sampler collects transformed data, because usually the transformation allows achieve better fitting. MOLUSCE uses two transformations:

Dummy coding for categoracal variables.
Normalization for continuous and ordered variables.

Dummy coding

Because the initial state raster and change map are categorical data, and each of them must be transformed to set of independent variables. To do that the plugin uses dummy coding when a categorical variable of N levels is divided into N-1 dummy variables. For example, for 4 classes land use map (Agriculture, Forest, Urban, Water), we have to use 3 dummy variables:

	Dummy 1	Dummy 2	Dummy 3
Agriculture	0	0	1
Forest	0	1	0
Urban	1	0	0
Water	0	0	0

One row (in this case Water) consists of zeros. It is important because otherwise the dummy variables became linear-dependent. The condition of linear independence of data is crucial for Logistic regression and strongly recommended for Multilayer Percepton.

Normalization

The next stage of initialization procedure is normalization of factor data. Usually normalization allows to achieve more efficient training and more accurate prediction result. Plugin uses linear normalization:

X_{n}={\frac {X-m_{x}}{\sigma _{x}}}

where X is a variable, $X_{n}$ is normalized variable, $m_{x}$ is mean of X and ${\sigma _{x}}$ is standard deviation of X.

Multilayer perceptron

The plugin uses classic realization of Multilayer perceptron. The input data is a collection of pixels of initial state raster and factor rasters. The target output data is the change map.

So, the perceptron model performs the next actions:

Initial preprocessing of the data (dummy coding and normalization).
Sampling.
Training.

The initial stages are described (in general) in the previous section, this subsection discussed the structure of the neural net, the training procedure and particular features of initialization.

MOLUSCE uses multilayer percepron with numpy.tanh sigmoid function. Therefore target variables (the change map categories) should be scaled to (-1, 1) interval during dummy coding instead (0, 1). A user can set arbitrary number of hidden layers (one or more) and arbitrary numbers (one or more) of neurons in the layers.

The created net has

$(C-1)(2N+1)^{2}+B(2N+1)^{2}$ input neurons;
$M$ output neurons (it depends on samplig mode);

Where C is the count of land use categories, N is the neighborhood size specified by the user, B is the summary band count of factor rasters (for example if two factors are user one-band and three-band, then B=4). M usually is a count of uniques categories in the change map ( $C^{2}$ ). But the count of output neurons can be less then M, it depends on the sampling mode. If mode "Random" is used, some small classes could be omitted during sampling procedure.

The module uses classic backpropagation algorithm with momentum for the learning procedure. Weights correction are performed as

w(n+1)=r*dw(n)+m*dw(n-1)

where w is a vector of neuron weights, dw is a vector of weights changes, n is an iteration number, r is learning rate, m is momentum.

Training set is divided in two parts learning set (80% of samples by defaults) and validation set (20% of samples).

The module uses on-line learning with stochastic: a random sample is selected from the learning set, the weights of the net are updated during forward/backward propagation.

A error of fitting (for a sample) is the average square error of partial outputs of the net:

E={\frac {t_{i}-o_{i}}{d}}

where E is a sample error, $t_{i}$ is the target value of a output neuron for given sample, $o_{i}$ is the real output value of the neuron, d is the count of output neurons.

The module returns the next information about learning:

Graph of the average square errors. The errors (mean of all E on learning set and validation set) are calculated on learning and validation sets after every epoch and plotted on the graph.
Min validation overall error. It is average square error E that is achieved on all validation set (mean of all E on validation set). This is the best currently achieved result. The weights of the net are stored and after training the best weights will be returned as the best net.
Delta overall accuracy. It is the difference between min validation error and current validation error.
Current validation kappa. It is the kappa statistic achieved on the validation set.

Logistic regression

Logistic regression model behavior is very similar to the ANN. The input data is a collection of pixels of initial state raster and factor rasters. The target output data is the change map.

So, the logistic regression model also performs the next actions:

Initial preprocessing of the data (dummy coding and normalization).
Sampling.
Training.

The created model has $M((C-1)(2N+1)^{2}+B(2N+1)^{2}+1)$ coefficients, but only $(M-1)((C-1)(2N+1)^{2}+B(2N+1)^{2})$ are independent. As before (see perceptron discussion), C is the count of land use categories, N is neighborhood size specified by the user, B is summary band count of factor rasters (for example if two factors are user one-band and three-band, then B=4). M usually is count of uniques categories in the change map ( $C^{2}$ ). M depends on the sampling mode also.

For example, discuss coefficients of one transition class (transition from land use category 2 to land use category 3, for example). If we have four land use categories 1,2,3,4 and we select one one-band factor and we use 0-neighborhood, we have 5 coefficients per transition class:

$b_{0}$ is an intercept;
$b_{1},b_{2},b_{3}$ are the coefficients of dummy variables of current landuse state (categories of the raster are coded as: 1 = (0, 0, 1); 2 = (0, 1, 0); 3 = (1, 0, 0); 4 = (0, 0, 0), see Sampler discussion);
$b_{4}$ is the coefficient of factor pixel (0-neighborhood).

If we select two one-band factors, we have 6 coefficients per transition class:

$b_{0}$ is an intercept;
$b_{1},b_{2},b_{3}$ are the coefficients of dummy variables of current landuse state;
$b_{4}$ is the coefficient of the first factor (0-neighborhood).
$b_{5}$ is the coefficient of the second factor (0-neighborhood).

if we use 1-neighborhood and two one-band factors, we have 37 coefficients per transition class:

$b_{0}$ is an intercept;
coefficients from $b_{1}$ thru $b_{27}$ are the coefficients of dummy variables of current landuse state: ( $b_{1}$ , ..., $b_{3}$ are dummy variables for the upper-left corner of 3x3 neighborhood window);
coefficients from $b_{28}$ thru $b_{36}$ are the coefficients of the first raster pixels (1-neighborhood uses 9 pixels or 3x3 window);
coefficients from $b_{37}$ thru $b_{45}$ are the coefficients of the second raster pixels.

Weight of evidence

Рассказать, про то, как бьются непрерывные растры на подклассы и почему. Рассказать, что происходит, если растр распознан как категориальный.

Multi criteria evaluation

Simulation

Validation

Special topics

Memory Usage

Sampler uses approximately $N*(8*b*c*w+3)$ bytes of memory

Landscape change analysis with MOLUSCE - methods and algorithms

Содержание

Terminology used in this document

General structure of the plugin