Index | Topics |

## Pre-process data to better learn Bayesian networks

### Description

Screen and transform the data to make them more suitable for structure and parameter learning.

### Usage

# discretize continuous data into factors. discretize(data, method, breaks = 3, ordered = FALSE, ..., debug = FALSE) # screen continuous data for highly correlated pairs of variables. dedup(data, threshold, debug = FALSE)

### Arguments

`data` |
a data frame containing numeric columns (for |

`threshold` |
a numeric value between zero and one, the absolute correlation used a threshold in screening highly correlated pairs. |

`method` |
a character string, either |

`breaks` |
if |

`ordered` |
a boolean value. If |

`...` |
additional tuning parameters, see below. |

`debug` |
a boolean value. If |

### Details

`discretize()`

takes a data frame of continuous variables as its first argument and returns a
secdond data frame of discrete variables, transformed using of three methods: `interval`

,
`quantile`

or `hartemink`

.

`dedup()`

screens the data for pairs of highly correlated variables, and discards one in each
pair.

### Value

`discretize()`

returns a data frame with the same structure (number of columns, column names,
etc.) as `data`

, containing the discretized variables.

`dedup()`

returns a data frame with a subset of the columns of `data`

.

### Note

Hartemink's algorithm has been designed to deal with sets of homogeneous, continuous variables; this is the
reason why they are initially transformed into discrete variables, all with the same number of levels (given by
the `ibreaks`

argument). Which of the other algorithms is used is specified by the
`idisc`

argument (`quantile`

is the default). The implementation in bnlearn also handles sets of discrete variables with the same number of levels, which are treated
as adjacent interval identifiers. This allows the user to perform the initial discretization with the algorithm
of his choice, as long as all variables have the same number of levels in the end.

### Author(s)

Marco Scutari

### References

Hartemink A (2001). *Principled Computational Methods for the Validation and Discovery of Genetic
Regulatory Networks*. Ph.D. thesis, School of Electrical Engineering and Computer Science, Massachusetts
Institute of Technology.

### Examples

data(gaussian.test) d = discretize(gaussian.test, method = 'hartemink', breaks = 4, ibreaks = 20) plot(hc(d)) d2 = dedup(gaussian.test)

Index | Topics |