sd4py.sd4py
sd4py is a package that makes it easy to perform subgroup discovery on tabular data. It is extremely simple to use. Call the sd4py.discover_subgroups() function on a pandas dataframe and a collection of subgroups will be returned.
This package provides a Python interface for using the Java application VIKAMINE.
Subgroup discovery is based on finding patterns within some (explanatory) columns of data that then help to explain another (target) column of data. The goal of the subgroup discovery process will be to understand in what circumstances the target is extreme. With a numeric target, this means finding circumstances in which the value is exceptionally high (or exceptionally low) on average. For a non-numeric target, this means looking for circumstances when a particular value is especially likely to occur. One of the key benefits of this approach is that the outputs are interpretable, being expressed as a readable combination of rules like (e.g.) "'Temperature'=high AND 'Pressure'=low".
The package contains a discover_subgroups() function that finds subgroups based on a pandas DataFrame and a specifed target column. The package also includes custom python objects for holding the results.
1""" 2sd4py is a package that makes it easy to perform subgroup discovery on tabular data. It is extremely simple to use. Call the `sd4py.discover_subgroups()` function on a pandas dataframe and a collection of subgroups will be returned. 3 4This package provides a Python interface for using the Java application VIKAMINE. 5 6Subgroup discovery is based on finding patterns within some (explanatory) columns of data that then help to explain another (target) column of data. 7The goal of the subgroup discovery process will be to understand in what circumstances the target is extreme. With a numeric target, this means finding circumstances in which the value is exceptionally high (or exceptionally low) on average. 8For a non-numeric target, this means looking for circumstances when a particular value is especially likely to occur. 9One of the key benefits of this approach is that the outputs are interpretable, being expressed as a readable combination of rules like (e.g.) "'Temperature'=high AND 'Pressure'=low". 10 11The package contains a `discover_subgroups()` function that finds subgroups based on a pandas `DataFrame` and a specifed target column. The package also includes custom python objects for holding the results. 12""" 13import pkg_resources 14 15import jpype 16import jpype.imports 17from jpype.types import * 18 19jpype.startJVM(classpath=[pkg_resources.resource_filename("sd4py", "vikamine_kernel.jar"), pkg_resources.resource_filename("sd4py", "sd4py.jar")]) ## resource_filename to make it work in a distributed package 20import java.util.HashSet 21 22from org.vikamine.kernel.subgroup.selectors import * 23from org.sd4py.kernel import * 24 25import pandas as pd 26import numpy as np 27 28import copy 29 30 31class PyOntology: 32 """ 33 Puts data into a Java `Ontology` object for use with the underlying Java subgroup discovery application. 34 35 It is not necessary to use this class explicitly; pandas dataframes will automatically be converted into a `PyOntology` object when passed into `discover_subgroups()`. 36 However, if the dataset is large, and subgroup discovery will be performed multiple times, then the user may opt to convert the dataset into a `PyOntology` to pass into `discover_subgroups()` for the sake of performance. 37 38 Attributes 39 -------------- 40 The only attribute of the class is `ontology`, created during initialisation, which is bound to an `Ontology` object in the Java runtime. 41 """ 42 43 def __init__(self, df): 44 # self.df = df.copy(deep=False) 45 # self.index = df.index.copy(deep=False) 46 self.column_names = df.columns.to_list() 47 48 self.column_types = [] 49 self.datetime_columns = {} 50 self.timedelta_columns = [] 51 52 numeric_arrays = [] 53 nominal_arrays = [] 54 55 for name, x in iter(df.items()): 56 if ( 57 x.dtype == "object" or x.dtype == "bool" or x.dtype.name == "category" 58 ): # category depends on whether it's ordered? 59 nominal_arrays.append(JArray(JString)(x.astype(str).values)) 60 self.column_types.append("nominal") 61 62 elif np.issubdtype(x.dtype, np.datetime64): 63 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 64 self.column_types.append("numeric") 65 self.datetime_columns[name] = x.dt.tz 66 67 elif np.issubdtype(x.dtype, np.timedelta64): 68 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 69 self.column_types.append("numeric") 70 self.timedelta_columns.append(name) 71 72 elif np.issubdtype(x.dtype, np.number): 73 if np.issubdtype( 74 x.dtype, "float16" 75 ): # there is no float16 interface between numpy and jpype, so it would raise an error to use .values 76 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x))) 77 self.column_types.append("numeric") 78 79 else: 80 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 81 self.column_types.append("numeric") 82 83 else: 84 raise ValueError("Unrecognised pandas dtype for :{0}".format(x)) 85 86 numeric_arrays = JArray(JArray(JDouble))(numeric_arrays) 87 nominal_arrays = JArray(JArray(JString))(nominal_arrays) 88 89 self.ontology = SD4PyOntologyCreator( 90 JArray(JString)(self.column_names), 91 JArray(JString)(self.column_types), 92 numeric_arrays, 93 nominal_arrays, 94 ).ontology 95 96 97class PyNumericSelector: 98 """ 99 Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition. 100 The relevant attribute name in the data is stored in `attribute`. 101 This contains a `numeric lower_bound`, `upper_bound`, plus booleans `include_lower_bound` and `include_upper_bound` to decide whether border values are included in the selection. 102 103 Note that this is detached from the Java runtime, and so is a plain python object. 104 """ 105 106 def __init__( 107 self, 108 attribute, 109 lower_bound, 110 upper_bound, 111 include_lower_bound, 112 include_upper_bound, 113 ): 114 self.attribute = attribute 115 self.lower_bound = lower_bound 116 self.upper_bound = upper_bound 117 self.include_lower_bound = include_lower_bound 118 self.include_upper_bound = include_upper_bound 119 120 def __str__(self): 121 out_string = "" 122 123 if isinstance(self.lower_bound, float) and isinstance(self.upper_bound, float): 124 if self.lower_bound != float("-inf"): 125 if self.include_lower_bound: 126 out_string += "{0:.2f} <= ".format(self.lower_bound) 127 else: 128 out_string += "{0:.2f} < ".format(self.lower_bound) 129 130 out_string += self.attribute 131 132 if self.upper_bound != float("inf"): 133 if self.include_upper_bound: 134 out_string += " <= {0:.2f}".format(self.upper_bound) 135 else: 136 out_string += " < {0:.2f}".format(self.upper_bound) 137 138 return out_string 139 140 else: ## For datetimes and so on 141 if self.lower_bound != float("-inf"): 142 if self.include_lower_bound: 143 out_string += "{0} <= ".format(self.lower_bound) 144 else: 145 out_string += "{0} < ".format(self.lower_bound) 146 147 out_string += self.attribute 148 149 if self.upper_bound != float("inf"): 150 if self.include_upper_bound: 151 out_string += " <= {0}".format(self.upper_bound) 152 else: 153 out_string += " < {0}".format(self.upper_bound) 154 155 return out_string 156 157 158class PyNominalSelector: 159 """ 160 Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition. 161 It indicates an attribute-value pair through `attribute` and `value`. 162 163 Note that this is detached from the Java runtime, and so is a plain python object. 164 """ 165 166 def __init__(self, attribute, value): 167 self.attribute = attribute 168 self.value = value 169 170 def __str__(self): 171 return "{0} = {1}".format(self.attribute, self.value) 172 173 174class PySubgroup: 175 """ 176 Represents a subgroup in terms of its selectors, target evaluation, size and quality. 177 178 Note that this is detached from the Java runtime, and so is a plain python object. 179 180 Attributes 181 -------------- 182 selectors: list 183 A list of `PySelector` objects representing the rules constituting the subgroup/pattern. 184 target_evaluation: float 185 The value of the target variable for this subgroup (when evaluated against the dataset originally used for subgroup discovery). 186 size: int 187 The number of members in this subgroup (when evaluated against the dataset originally used for subgroup discovery). 188 quality: float 189 The quality of this subgroup (when applying the quality function to the dataset originally used for subgroup discovery). 190 target: string 191 The name of the target column. 192 target_value: object 193 The value of the target variable that counts as the 'positive' class. 194 """ 195 196 def __init__( 197 self, selectors, target_evaluation, size, quality, target, target_value 198 ): 199 self.selectors = selectors 200 self.target_evaluation = target_evaluation 201 self.size = size 202 self.quality = quality 203 self.target = target 204 self.target_value = target_value 205 206 def __str__(self): 207 return " AND ".join([str(sel) for sel in self.selectors]).strip() 208 209 def get_indices(self, data): 210 """ 211 Get the indices of rows that meet the subgroup definition for a specified dataset. 212 213 Parameters 214 ---------------- 215 data: pandas DataFrame 216 The dataset in which to look for (the indices of) rows that match the subgroup definition. 217 218 Returns 219 ----------- 220 index: pandas Index 221 The index identifying rows that meet the subgroup definition in the dataset provided. 222 """ 223 224 logical_indices = np.ones(data.index.shape, dtype=bool) 225 226 for sel in self.selectors: 227 if isinstance(sel, PyNumericSelector): 228 if sel.include_lower_bound and sel.lower_bound != float("-inf"): 229 logical_indices = logical_indices & ( 230 data[sel.attribute].values >= sel.lower_bound 231 ) 232 elif sel.lower_bound != float("-inf"): 233 logical_indices = logical_indices & ( 234 data[sel.attribute].values > sel.lower_bound 235 ) 236 if sel.include_upper_bound and sel.upper_bound != float("inf"): 237 logical_indices = logical_indices & ( 238 data[sel.attribute].values <= sel.upper_bound 239 ) 240 elif sel.upper_bound != float("inf"): 241 logical_indices = logical_indices & ( 242 data[sel.attribute].values < sel.upper_bound 243 ) 244 245 if isinstance(sel, PyNominalSelector): 246 logical_indices = logical_indices & ( 247 data[sel.attribute].astype(str).values == sel.value 248 ) 249 250 return data.index[logical_indices] 251 252 def get_rows(self, data): 253 """ 254 Get the rows that meet the subgroup definition for a specified dataset. 255 256 Parameters 257 ---------------- 258 data: pandas DataFrame 259 The dataset in which to look for rows that match the subgroup definition. 260 261 Returns 262 ----------- 263 rows: pandas DataFrame 264 A selection of rows in the provided dataset that meet the subgroup definition. 265 """ 266 267 return data.loc[self.get_indices(data)] 268 269 270class PySubgroupResults: 271 """ 272 A collection of subgroups, returned as a result of performing subgroup discovery. 273 274 Note that this is detached from the Java runtime, and so is a plain python object. 275 276 Attributes 277 -------------- 278 subgroups: list 279 A list of `PySubgroup` objects. 280 population_evaluation: float 281 The value of the target variable across the entire dataset originally used for subgroup discovery. 282 population_size: int 283 The number of rows in the dataset originally used for subgroup discovery. 284 target: string 285 The name of the target column. 286 target_value: object 287 The value of the target variable that counts as the 'positive' class. 288 """ 289 290 def __init__( 291 self, subgroups, population_evaluation, population_size, target, target_value 292 ): 293 self.subgroups = subgroups 294 self.population_evaluation = population_evaluation 295 self.population_size = population_size 296 self.target = target 297 self.target_value = target_value 298 299 def __len__(self): 300 return len(self.subgroups) 301 302 def __iter__(self): 303 return self.subgroups.__iter__() 304 305 def __getitem__(self, selection): 306 if isinstance(selection, list) and isinstance(selection[0], str): 307 subgroups = [sg for sg in self.subgroups if str(sg) in selection] 308 309 not_present = [ 310 sel for sel in selection if sel not in [str(sg) for sg in subgroups] 311 ] 312 313 if len(not_present) > 0: 314 raise ValueError( 315 "Indices {} not found in {}.".format(not_present, self) 316 ) 317 318 else: 319 subgroups = [sg for sg in self.subgroups if str(sg) in selection] 320 321 out = copy.copy(self) 322 out.subgroups = subgroups 323 324 return out 325 326 if isinstance(selection, str): 327 for sg in self.subgroups: 328 if str(sg) == selection: 329 return sg 330 331 raise ValueError("Index {} not found in {}.".format(not_present, self)) 332 333 if hasattr(selection, "__iter__"): 334 subgroups = [self.subgroups[i] for i in selection] 335 336 out = copy.copy(self) 337 out.subgroups = subgroups 338 339 return out 340 341 if isinstance(selection, slice): 342 out = copy.copy(self) 343 out.subgroups = self.subgroups.__getitem__(selection) 344 345 return out 346 347 return self.subgroups[selection] 348 349 def to_df(self): 350 """ 351 Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing. 352 353 Returns 354 ----------- 355 subgroups_df: pandas DataFrame 356 A table showing the subgroup definitions and associated important values like size, target value, and quality. 357 """ 358 359 return pd.DataFrame( 360 [ 361 { 362 "pattern": str(sg), 363 "target_evaluation": sg.target_evaluation, 364 "size": sg.size, 365 "quality": sg.quality, 366 } 367 for sg in self.subgroups 368 ] 369 ) 370 371 372def discover_subgroups( 373 ontology, 374 target, 375 target_value=None, 376 included_attributes=None, 377 # discretise=True, 378 nbins=3, 379 method="sdmap", 380 qf="ps", 381 k=20, 382 minqual=0, 383 minsize=0, 384 mintp=0, 385 max_selectors=3, 386 ignore_defaults=False, 387 filter_irrelevant=False, 388 postfilter="", 389 postfilter_param=0.00, ## Must be provided for most postfiltering types 390): 391 """ 392 Search for interesting subgroups within a dataset. 393 394 Parameters 395 ---------------- 396 ontology: pandas DataFrame or PyOntology object. 397 The data to use to peform subgroup discovery. Can be a pandas DataFrame, or a PyOntology object. 398 target: string 399 The name of the column to be used as the target. 400 target_value: object, optional 401 The value of the target variable that counts as the 'positive' class. Not needed for a numeric target, in which case the mean of the target variable will be used for subgroup discovery. 402 included_attributes: list, optional 403 A list of strings containing the names of columns to use. If not specified, all columns of the data will be used. 404 nbins: int, optional 405 The number of bins to use when discretising numeric variables. Default value is 3. 406 method: string, optional 407 Used to decide which algorithm to use. Must be one of Beam-Search `beam`, BSD `bsd`, SD-Map `sdmap`, SD-Map enabling internal disjunctions `sdmap-dis`. The default is `sdmap`. 408 qf: string, optional 409 Used to decide which algorithm to use. Must be one of Adjusted Residuals `ares`, Binomial Test `bin`, Chi-Square Test `chi2`, Gain `gain`, Lift `lift`, Piatetsky-Shapiro `ps`, Relative Gain `relgain`, Weighted Relative Accuracy `wracc`, Wilcoxon-Mann-Whitney Rank `wmw`, Area-Under-Curve `auc`. The default is qf = `ps`. 410 k: int, optional 411 Maximum number (top-k) of patterns to discover, i.e., the best k patterns according to the selected quality function. The default is 20. 412 minqual: float, optional 413 The minimal quality. Defaults to 0, meaning there is no minimum. 414 minsize: int, optional 415 The minimum size of a subgroup in order for it to be included in the results. Defaults to 0, meaning there is no minimum. 416 mintp: int, optional 417 The minimum number of true positives in a subgroup (relevant for binary target concepts only). Defaults to 0, meaning there is no minimum 418 max_selectors: int, optional 419 The maximum number of selectors/rules included in a subgroup. The default is 3. 420 ignore_defaults: bool, optional 421 If set to True , the values in the first row of data will be considered ‘default values’, and the same values will be ignored when searching for subgroups. Defaults to False. 422 filter_irrelevant: bool, optional 423 Whether irrelevant patterns are filtered out. Note that this negatively impacts performance. Defaults to False. 424 postfilter: string, optional 425 Which post-processing filter is applied. 426 Can be one of: 427 * Minimum Improvement (Global) `min_improve_global`, which checks the patterns against all possible generalisations; 428 * Minimum Improvement (Pattern Set) `min_improve_set`, checks the patterns against all their generalisations in the result set, 429 * Relevancy Filter `relevancy`, removes patterns that are strictly irrelevant, 430 * Significant Improvement (Global) `sig_improve_global`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all their possible generalizations, 431 * Significant Improvement (Set) `sig_improve_set`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all generalizations in the result set, 432 * Weighted Covering `weighted_covering`, performs weighted covering on the data in order to select a covering set of subgroups while reducing the overlap on the data. 433 By default, no postfilter is set, i.e., postfilter = "". 434 postfilter_param: float, optional 435 Provides the corresponding parameter value for the filtering chosen in postfilter. Must be provided for most postfiltering types 436 437 Returns 438 ----------- 439 subgroups: PySubgroupResults 440 The discovered subgroups. 441 """ 442 443 if isinstance(ontology, PyOntology): 444 if target_value is not None: 445 raise ValueError( 446 "target_value cannot be provided when passing in a PyOntology instead of a pands DataFrame." 447 ) 448 449 else: 450 if target_value is not None: 451 target_bool = ontology[target] == target_value 452 453 ontology = ontology.drop(columns=target).join(target_bool) 454 455 elif ontology[target].dtype == "bool": 456 target_value = True 457 458 elif ( 459 ontology[target].dtype == "object" 460 or ontology[target].dtype.name == "category" 461 ): 462 target_value = ontology[target].iloc[0] 463 464 ontology = PyOntology( 465 ontology.reset_index(drop=True) 466 ) ## Reset index because pandas seems to confuse itself when there is a MultiIndex!!! 467 468 ont = ontology.ontology 469 470 includedAttributes = java.util.HashSet() 471 if included_attributes: 472 includedAttributes.addAll(included_attributes) 473 else: 474 includedAttributes.addAll(ontology.column_names) 475 476 if target not in includedAttributes: 477 includedAttributes.add(target) 478 479 subgroups = SD4Py.discoverSubgroups( 480 ont, 481 JString(target), 482 includedAttributes, 483 JInt(nbins), 484 JString(method), 485 JString(qf), 486 JInt(k), 487 JDouble(minqual), 488 JInt(minsize), 489 JInt(mintp), 490 JInt(max_selectors), 491 JBoolean(ignore_defaults), 492 JBoolean(filter_irrelevant), 493 JString(postfilter), 494 JDouble(postfilter_param), 495 ) 496 497 py_subgroups = [] 498 499 population_value = None 500 population_size = None 501 502 for sg in subgroups.sortSubgroupsByQualityDescending(): 503 py_selectors = [] 504 505 for selector in sg.getDescription(): 506 if isinstance(selector, DefaultSGSelector): 507 py_selectors.append( 508 PyNominalSelector( 509 str(selector.getAttribute().getId()), 510 str(list(selector.getValues())[0]), 511 ) 512 ) 513 514 if isinstance(selector, NumericSelector): 515 lb = selector.getLowerBound() 516 ub = selector.getUpperBound() 517 518 if str(selector.getAttribute().getId()) in ontology.datetime_columns: 519 try: ## convert back to datetime 520 lb = ( 521 pd.to_datetime(int(lb)) 522 .tz_localize("GMT") 523 .tz_convert( 524 ontology.datetime_columns[ 525 str(selector.getAttribute().getId()) 526 ] 527 ) 528 ) 529 except OverflowError: ## if 'inf' 530 lb = float("-inf") 531 532 try: 533 ub = ( 534 pd.to_datetime(int(ub)) 535 .tz_localize("GMT") 536 .tz_convert( 537 ontology.datetime_columns[ 538 str(selector.getAttribute().getId()) 539 ] 540 ) 541 ) 542 except OverflowError: 543 ub = float("inf") 544 545 elif str(selector.getAttribute().getId()) in ontology.timedelta_columns: 546 try: ## convert back to timedelta 547 lb = pd.to_timedelta(int(lb)) 548 except OverflowError: ## if 'inf' 549 lb = float("-inf") 550 551 try: 552 ub = pd.to_timedelta(int(ub)) 553 except OverflowError: 554 ub = float("inf") 555 556 else: 557 lb = float(lb) 558 ub = float(ub) 559 560 py_selectors.append( 561 PyNumericSelector( 562 str(selector.getAttribute().getId()), 563 lb, 564 ub, 565 bool(selector.isIncludeLowerBound()), 566 bool(selector.isIncludeUpperBound()), 567 ) 568 ) 569 570 stats = sg.getStatistics() 571 population_value = float(stats.getTargetQuantityPopulation()) 572 population_size = float(stats.getDefinedPopulationCount()) 573 574 subgroup_value = float(stats.getTargetQuantitySG()) 575 subgroup_size = float(stats.getSubgroupSize()) 576 subgroup_quality = float(sg.getQuality()) 577 578 py_subgroups.append( 579 PySubgroup( 580 py_selectors, 581 subgroup_value, 582 subgroup_size, 583 subgroup_quality, 584 target, 585 target_value, 586 ) 587 ) 588 589 return PySubgroupResults( 590 py_subgroups, population_value, population_size, target, target_value 591 )
32class PyOntology: 33 """ 34 Puts data into a Java `Ontology` object for use with the underlying Java subgroup discovery application. 35 36 It is not necessary to use this class explicitly; pandas dataframes will automatically be converted into a `PyOntology` object when passed into `discover_subgroups()`. 37 However, if the dataset is large, and subgroup discovery will be performed multiple times, then the user may opt to convert the dataset into a `PyOntology` to pass into `discover_subgroups()` for the sake of performance. 38 39 Attributes 40 -------------- 41 The only attribute of the class is `ontology`, created during initialisation, which is bound to an `Ontology` object in the Java runtime. 42 """ 43 44 def __init__(self, df): 45 # self.df = df.copy(deep=False) 46 # self.index = df.index.copy(deep=False) 47 self.column_names = df.columns.to_list() 48 49 self.column_types = [] 50 self.datetime_columns = {} 51 self.timedelta_columns = [] 52 53 numeric_arrays = [] 54 nominal_arrays = [] 55 56 for name, x in iter(df.items()): 57 if ( 58 x.dtype == "object" or x.dtype == "bool" or x.dtype.name == "category" 59 ): # category depends on whether it's ordered? 60 nominal_arrays.append(JArray(JString)(x.astype(str).values)) 61 self.column_types.append("nominal") 62 63 elif np.issubdtype(x.dtype, np.datetime64): 64 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 65 self.column_types.append("numeric") 66 self.datetime_columns[name] = x.dt.tz 67 68 elif np.issubdtype(x.dtype, np.timedelta64): 69 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 70 self.column_types.append("numeric") 71 self.timedelta_columns.append(name) 72 73 elif np.issubdtype(x.dtype, np.number): 74 if np.issubdtype( 75 x.dtype, "float16" 76 ): # there is no float16 interface between numpy and jpype, so it would raise an error to use .values 77 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x))) 78 self.column_types.append("numeric") 79 80 else: 81 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 82 self.column_types.append("numeric") 83 84 else: 85 raise ValueError("Unrecognised pandas dtype for :{0}".format(x)) 86 87 numeric_arrays = JArray(JArray(JDouble))(numeric_arrays) 88 nominal_arrays = JArray(JArray(JString))(nominal_arrays) 89 90 self.ontology = SD4PyOntologyCreator( 91 JArray(JString)(self.column_names), 92 JArray(JString)(self.column_types), 93 numeric_arrays, 94 nominal_arrays, 95 ).ontology
Puts data into a Java Ontology object for use with the underlying Java subgroup discovery application.
It is not necessary to use this class explicitly; pandas dataframes will automatically be converted into a PyOntology object when passed into discover_subgroups().
However, if the dataset is large, and subgroup discovery will be performed multiple times, then the user may opt to convert the dataset into a PyOntology to pass into discover_subgroups() for the sake of performance.
Attributes
- The only attribute of the class is
ontology, created during initialisation, which is bound to anOntologyobject in the Java runtime.
44 def __init__(self, df): 45 # self.df = df.copy(deep=False) 46 # self.index = df.index.copy(deep=False) 47 self.column_names = df.columns.to_list() 48 49 self.column_types = [] 50 self.datetime_columns = {} 51 self.timedelta_columns = [] 52 53 numeric_arrays = [] 54 nominal_arrays = [] 55 56 for name, x in iter(df.items()): 57 if ( 58 x.dtype == "object" or x.dtype == "bool" or x.dtype.name == "category" 59 ): # category depends on whether it's ordered? 60 nominal_arrays.append(JArray(JString)(x.astype(str).values)) 61 self.column_types.append("nominal") 62 63 elif np.issubdtype(x.dtype, np.datetime64): 64 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 65 self.column_types.append("numeric") 66 self.datetime_columns[name] = x.dt.tz 67 68 elif np.issubdtype(x.dtype, np.timedelta64): 69 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 70 self.column_types.append("numeric") 71 self.timedelta_columns.append(name) 72 73 elif np.issubdtype(x.dtype, np.number): 74 if np.issubdtype( 75 x.dtype, "float16" 76 ): # there is no float16 interface between numpy and jpype, so it would raise an error to use .values 77 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x))) 78 self.column_types.append("numeric") 79 80 else: 81 numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values)) 82 self.column_types.append("numeric") 83 84 else: 85 raise ValueError("Unrecognised pandas dtype for :{0}".format(x)) 86 87 numeric_arrays = JArray(JArray(JDouble))(numeric_arrays) 88 nominal_arrays = JArray(JArray(JString))(nominal_arrays) 89 90 self.ontology = SD4PyOntologyCreator( 91 JArray(JString)(self.column_names), 92 JArray(JString)(self.column_types), 93 numeric_arrays, 94 nominal_arrays, 95 ).ontology
98class PyNumericSelector: 99 """ 100 Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition. 101 The relevant attribute name in the data is stored in `attribute`. 102 This contains a `numeric lower_bound`, `upper_bound`, plus booleans `include_lower_bound` and `include_upper_bound` to decide whether border values are included in the selection. 103 104 Note that this is detached from the Java runtime, and so is a plain python object. 105 """ 106 107 def __init__( 108 self, 109 attribute, 110 lower_bound, 111 upper_bound, 112 include_lower_bound, 113 include_upper_bound, 114 ): 115 self.attribute = attribute 116 self.lower_bound = lower_bound 117 self.upper_bound = upper_bound 118 self.include_lower_bound = include_lower_bound 119 self.include_upper_bound = include_upper_bound 120 121 def __str__(self): 122 out_string = "" 123 124 if isinstance(self.lower_bound, float) and isinstance(self.upper_bound, float): 125 if self.lower_bound != float("-inf"): 126 if self.include_lower_bound: 127 out_string += "{0:.2f} <= ".format(self.lower_bound) 128 else: 129 out_string += "{0:.2f} < ".format(self.lower_bound) 130 131 out_string += self.attribute 132 133 if self.upper_bound != float("inf"): 134 if self.include_upper_bound: 135 out_string += " <= {0:.2f}".format(self.upper_bound) 136 else: 137 out_string += " < {0:.2f}".format(self.upper_bound) 138 139 return out_string 140 141 else: ## For datetimes and so on 142 if self.lower_bound != float("-inf"): 143 if self.include_lower_bound: 144 out_string += "{0} <= ".format(self.lower_bound) 145 else: 146 out_string += "{0} < ".format(self.lower_bound) 147 148 out_string += self.attribute 149 150 if self.upper_bound != float("inf"): 151 if self.include_upper_bound: 152 out_string += " <= {0}".format(self.upper_bound) 153 else: 154 out_string += " < {0}".format(self.upper_bound) 155 156 return out_string
Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition.
The relevant attribute name in the data is stored in attribute.
This contains a numeric lower_bound, upper_bound, plus booleans include_lower_bound and include_upper_bound to decide whether border values are included in the selection.
Note that this is detached from the Java runtime, and so is a plain python object.
107 def __init__( 108 self, 109 attribute, 110 lower_bound, 111 upper_bound, 112 include_lower_bound, 113 include_upper_bound, 114 ): 115 self.attribute = attribute 116 self.lower_bound = lower_bound 117 self.upper_bound = upper_bound 118 self.include_lower_bound = include_lower_bound 119 self.include_upper_bound = include_upper_bound
159class PyNominalSelector: 160 """ 161 Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition. 162 It indicates an attribute-value pair through `attribute` and `value`. 163 164 Note that this is detached from the Java runtime, and so is a plain python object. 165 """ 166 167 def __init__(self, attribute, value): 168 self.attribute = attribute 169 self.value = value 170 171 def __str__(self): 172 return "{0} = {1}".format(self.attribute, self.value)
Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition.
It indicates an attribute-value pair through attribute and value.
Note that this is detached from the Java runtime, and so is a plain python object.
175class PySubgroup: 176 """ 177 Represents a subgroup in terms of its selectors, target evaluation, size and quality. 178 179 Note that this is detached from the Java runtime, and so is a plain python object. 180 181 Attributes 182 -------------- 183 selectors: list 184 A list of `PySelector` objects representing the rules constituting the subgroup/pattern. 185 target_evaluation: float 186 The value of the target variable for this subgroup (when evaluated against the dataset originally used for subgroup discovery). 187 size: int 188 The number of members in this subgroup (when evaluated against the dataset originally used for subgroup discovery). 189 quality: float 190 The quality of this subgroup (when applying the quality function to the dataset originally used for subgroup discovery). 191 target: string 192 The name of the target column. 193 target_value: object 194 The value of the target variable that counts as the 'positive' class. 195 """ 196 197 def __init__( 198 self, selectors, target_evaluation, size, quality, target, target_value 199 ): 200 self.selectors = selectors 201 self.target_evaluation = target_evaluation 202 self.size = size 203 self.quality = quality 204 self.target = target 205 self.target_value = target_value 206 207 def __str__(self): 208 return " AND ".join([str(sel) for sel in self.selectors]).strip() 209 210 def get_indices(self, data): 211 """ 212 Get the indices of rows that meet the subgroup definition for a specified dataset. 213 214 Parameters 215 ---------------- 216 data: pandas DataFrame 217 The dataset in which to look for (the indices of) rows that match the subgroup definition. 218 219 Returns 220 ----------- 221 index: pandas Index 222 The index identifying rows that meet the subgroup definition in the dataset provided. 223 """ 224 225 logical_indices = np.ones(data.index.shape, dtype=bool) 226 227 for sel in self.selectors: 228 if isinstance(sel, PyNumericSelector): 229 if sel.include_lower_bound and sel.lower_bound != float("-inf"): 230 logical_indices = logical_indices & ( 231 data[sel.attribute].values >= sel.lower_bound 232 ) 233 elif sel.lower_bound != float("-inf"): 234 logical_indices = logical_indices & ( 235 data[sel.attribute].values > sel.lower_bound 236 ) 237 if sel.include_upper_bound and sel.upper_bound != float("inf"): 238 logical_indices = logical_indices & ( 239 data[sel.attribute].values <= sel.upper_bound 240 ) 241 elif sel.upper_bound != float("inf"): 242 logical_indices = logical_indices & ( 243 data[sel.attribute].values < sel.upper_bound 244 ) 245 246 if isinstance(sel, PyNominalSelector): 247 logical_indices = logical_indices & ( 248 data[sel.attribute].astype(str).values == sel.value 249 ) 250 251 return data.index[logical_indices] 252 253 def get_rows(self, data): 254 """ 255 Get the rows that meet the subgroup definition for a specified dataset. 256 257 Parameters 258 ---------------- 259 data: pandas DataFrame 260 The dataset in which to look for rows that match the subgroup definition. 261 262 Returns 263 ----------- 264 rows: pandas DataFrame 265 A selection of rows in the provided dataset that meet the subgroup definition. 266 """ 267 268 return data.loc[self.get_indices(data)]
Represents a subgroup in terms of its selectors, target evaluation, size and quality.
Note that this is detached from the Java runtime, and so is a plain python object.
Attributes
- selectors (list):
A list of
PySelectorobjects representing the rules constituting the subgroup/pattern. - target_evaluation (float): The value of the target variable for this subgroup (when evaluated against the dataset originally used for subgroup discovery).
- size (int): The number of members in this subgroup (when evaluated against the dataset originally used for subgroup discovery).
- quality (float): The quality of this subgroup (when applying the quality function to the dataset originally used for subgroup discovery).
- target (string): The name of the target column.
- target_value (object): The value of the target variable that counts as the 'positive' class.
210 def get_indices(self, data): 211 """ 212 Get the indices of rows that meet the subgroup definition for a specified dataset. 213 214 Parameters 215 ---------------- 216 data: pandas DataFrame 217 The dataset in which to look for (the indices of) rows that match the subgroup definition. 218 219 Returns 220 ----------- 221 index: pandas Index 222 The index identifying rows that meet the subgroup definition in the dataset provided. 223 """ 224 225 logical_indices = np.ones(data.index.shape, dtype=bool) 226 227 for sel in self.selectors: 228 if isinstance(sel, PyNumericSelector): 229 if sel.include_lower_bound and sel.lower_bound != float("-inf"): 230 logical_indices = logical_indices & ( 231 data[sel.attribute].values >= sel.lower_bound 232 ) 233 elif sel.lower_bound != float("-inf"): 234 logical_indices = logical_indices & ( 235 data[sel.attribute].values > sel.lower_bound 236 ) 237 if sel.include_upper_bound and sel.upper_bound != float("inf"): 238 logical_indices = logical_indices & ( 239 data[sel.attribute].values <= sel.upper_bound 240 ) 241 elif sel.upper_bound != float("inf"): 242 logical_indices = logical_indices & ( 243 data[sel.attribute].values < sel.upper_bound 244 ) 245 246 if isinstance(sel, PyNominalSelector): 247 logical_indices = logical_indices & ( 248 data[sel.attribute].astype(str).values == sel.value 249 ) 250 251 return data.index[logical_indices]
Get the indices of rows that meet the subgroup definition for a specified dataset.
Parameters
- data (pandas DataFrame): The dataset in which to look for (the indices of) rows that match the subgroup definition.
Returns
- index (pandas Index): The index identifying rows that meet the subgroup definition in the dataset provided.
253 def get_rows(self, data): 254 """ 255 Get the rows that meet the subgroup definition for a specified dataset. 256 257 Parameters 258 ---------------- 259 data: pandas DataFrame 260 The dataset in which to look for rows that match the subgroup definition. 261 262 Returns 263 ----------- 264 rows: pandas DataFrame 265 A selection of rows in the provided dataset that meet the subgroup definition. 266 """ 267 268 return data.loc[self.get_indices(data)]
Get the rows that meet the subgroup definition for a specified dataset.
Parameters
- data (pandas DataFrame): The dataset in which to look for rows that match the subgroup definition.
Returns
- rows (pandas DataFrame): A selection of rows in the provided dataset that meet the subgroup definition.
271class PySubgroupResults: 272 """ 273 A collection of subgroups, returned as a result of performing subgroup discovery. 274 275 Note that this is detached from the Java runtime, and so is a plain python object. 276 277 Attributes 278 -------------- 279 subgroups: list 280 A list of `PySubgroup` objects. 281 population_evaluation: float 282 The value of the target variable across the entire dataset originally used for subgroup discovery. 283 population_size: int 284 The number of rows in the dataset originally used for subgroup discovery. 285 target: string 286 The name of the target column. 287 target_value: object 288 The value of the target variable that counts as the 'positive' class. 289 """ 290 291 def __init__( 292 self, subgroups, population_evaluation, population_size, target, target_value 293 ): 294 self.subgroups = subgroups 295 self.population_evaluation = population_evaluation 296 self.population_size = population_size 297 self.target = target 298 self.target_value = target_value 299 300 def __len__(self): 301 return len(self.subgroups) 302 303 def __iter__(self): 304 return self.subgroups.__iter__() 305 306 def __getitem__(self, selection): 307 if isinstance(selection, list) and isinstance(selection[0], str): 308 subgroups = [sg for sg in self.subgroups if str(sg) in selection] 309 310 not_present = [ 311 sel for sel in selection if sel not in [str(sg) for sg in subgroups] 312 ] 313 314 if len(not_present) > 0: 315 raise ValueError( 316 "Indices {} not found in {}.".format(not_present, self) 317 ) 318 319 else: 320 subgroups = [sg for sg in self.subgroups if str(sg) in selection] 321 322 out = copy.copy(self) 323 out.subgroups = subgroups 324 325 return out 326 327 if isinstance(selection, str): 328 for sg in self.subgroups: 329 if str(sg) == selection: 330 return sg 331 332 raise ValueError("Index {} not found in {}.".format(not_present, self)) 333 334 if hasattr(selection, "__iter__"): 335 subgroups = [self.subgroups[i] for i in selection] 336 337 out = copy.copy(self) 338 out.subgroups = subgroups 339 340 return out 341 342 if isinstance(selection, slice): 343 out = copy.copy(self) 344 out.subgroups = self.subgroups.__getitem__(selection) 345 346 return out 347 348 return self.subgroups[selection] 349 350 def to_df(self): 351 """ 352 Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing. 353 354 Returns 355 ----------- 356 subgroups_df: pandas DataFrame 357 A table showing the subgroup definitions and associated important values like size, target value, and quality. 358 """ 359 360 return pd.DataFrame( 361 [ 362 { 363 "pattern": str(sg), 364 "target_evaluation": sg.target_evaluation, 365 "size": sg.size, 366 "quality": sg.quality, 367 } 368 for sg in self.subgroups 369 ] 370 )
A collection of subgroups, returned as a result of performing subgroup discovery.
Note that this is detached from the Java runtime, and so is a plain python object.
Attributes
- subgroups (list):
A list of
PySubgroupobjects. - population_evaluation (float): The value of the target variable across the entire dataset originally used for subgroup discovery.
- population_size (int): The number of rows in the dataset originally used for subgroup discovery.
- target (string): The name of the target column.
- target_value (object): The value of the target variable that counts as the 'positive' class.
350 def to_df(self): 351 """ 352 Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing. 353 354 Returns 355 ----------- 356 subgroups_df: pandas DataFrame 357 A table showing the subgroup definitions and associated important values like size, target value, and quality. 358 """ 359 360 return pd.DataFrame( 361 [ 362 { 363 "pattern": str(sg), 364 "target_evaluation": sg.target_evaluation, 365 "size": sg.size, 366 "quality": sg.quality, 367 } 368 for sg in self.subgroups 369 ] 370 )
Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing.
Returns
- subgroups_df (pandas DataFrame): A table showing the subgroup definitions and associated important values like size, target value, and quality.
373def discover_subgroups( 374 ontology, 375 target, 376 target_value=None, 377 included_attributes=None, 378 # discretise=True, 379 nbins=3, 380 method="sdmap", 381 qf="ps", 382 k=20, 383 minqual=0, 384 minsize=0, 385 mintp=0, 386 max_selectors=3, 387 ignore_defaults=False, 388 filter_irrelevant=False, 389 postfilter="", 390 postfilter_param=0.00, ## Must be provided for most postfiltering types 391): 392 """ 393 Search for interesting subgroups within a dataset. 394 395 Parameters 396 ---------------- 397 ontology: pandas DataFrame or PyOntology object. 398 The data to use to peform subgroup discovery. Can be a pandas DataFrame, or a PyOntology object. 399 target: string 400 The name of the column to be used as the target. 401 target_value: object, optional 402 The value of the target variable that counts as the 'positive' class. Not needed for a numeric target, in which case the mean of the target variable will be used for subgroup discovery. 403 included_attributes: list, optional 404 A list of strings containing the names of columns to use. If not specified, all columns of the data will be used. 405 nbins: int, optional 406 The number of bins to use when discretising numeric variables. Default value is 3. 407 method: string, optional 408 Used to decide which algorithm to use. Must be one of Beam-Search `beam`, BSD `bsd`, SD-Map `sdmap`, SD-Map enabling internal disjunctions `sdmap-dis`. The default is `sdmap`. 409 qf: string, optional 410 Used to decide which algorithm to use. Must be one of Adjusted Residuals `ares`, Binomial Test `bin`, Chi-Square Test `chi2`, Gain `gain`, Lift `lift`, Piatetsky-Shapiro `ps`, Relative Gain `relgain`, Weighted Relative Accuracy `wracc`, Wilcoxon-Mann-Whitney Rank `wmw`, Area-Under-Curve `auc`. The default is qf = `ps`. 411 k: int, optional 412 Maximum number (top-k) of patterns to discover, i.e., the best k patterns according to the selected quality function. The default is 20. 413 minqual: float, optional 414 The minimal quality. Defaults to 0, meaning there is no minimum. 415 minsize: int, optional 416 The minimum size of a subgroup in order for it to be included in the results. Defaults to 0, meaning there is no minimum. 417 mintp: int, optional 418 The minimum number of true positives in a subgroup (relevant for binary target concepts only). Defaults to 0, meaning there is no minimum 419 max_selectors: int, optional 420 The maximum number of selectors/rules included in a subgroup. The default is 3. 421 ignore_defaults: bool, optional 422 If set to True , the values in the first row of data will be considered ‘default values’, and the same values will be ignored when searching for subgroups. Defaults to False. 423 filter_irrelevant: bool, optional 424 Whether irrelevant patterns are filtered out. Note that this negatively impacts performance. Defaults to False. 425 postfilter: string, optional 426 Which post-processing filter is applied. 427 Can be one of: 428 * Minimum Improvement (Global) `min_improve_global`, which checks the patterns against all possible generalisations; 429 * Minimum Improvement (Pattern Set) `min_improve_set`, checks the patterns against all their generalisations in the result set, 430 * Relevancy Filter `relevancy`, removes patterns that are strictly irrelevant, 431 * Significant Improvement (Global) `sig_improve_global`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all their possible generalizations, 432 * Significant Improvement (Set) `sig_improve_set`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all generalizations in the result set, 433 * Weighted Covering `weighted_covering`, performs weighted covering on the data in order to select a covering set of subgroups while reducing the overlap on the data. 434 By default, no postfilter is set, i.e., postfilter = "". 435 postfilter_param: float, optional 436 Provides the corresponding parameter value for the filtering chosen in postfilter. Must be provided for most postfiltering types 437 438 Returns 439 ----------- 440 subgroups: PySubgroupResults 441 The discovered subgroups. 442 """ 443 444 if isinstance(ontology, PyOntology): 445 if target_value is not None: 446 raise ValueError( 447 "target_value cannot be provided when passing in a PyOntology instead of a pands DataFrame." 448 ) 449 450 else: 451 if target_value is not None: 452 target_bool = ontology[target] == target_value 453 454 ontology = ontology.drop(columns=target).join(target_bool) 455 456 elif ontology[target].dtype == "bool": 457 target_value = True 458 459 elif ( 460 ontology[target].dtype == "object" 461 or ontology[target].dtype.name == "category" 462 ): 463 target_value = ontology[target].iloc[0] 464 465 ontology = PyOntology( 466 ontology.reset_index(drop=True) 467 ) ## Reset index because pandas seems to confuse itself when there is a MultiIndex!!! 468 469 ont = ontology.ontology 470 471 includedAttributes = java.util.HashSet() 472 if included_attributes: 473 includedAttributes.addAll(included_attributes) 474 else: 475 includedAttributes.addAll(ontology.column_names) 476 477 if target not in includedAttributes: 478 includedAttributes.add(target) 479 480 subgroups = SD4Py.discoverSubgroups( 481 ont, 482 JString(target), 483 includedAttributes, 484 JInt(nbins), 485 JString(method), 486 JString(qf), 487 JInt(k), 488 JDouble(minqual), 489 JInt(minsize), 490 JInt(mintp), 491 JInt(max_selectors), 492 JBoolean(ignore_defaults), 493 JBoolean(filter_irrelevant), 494 JString(postfilter), 495 JDouble(postfilter_param), 496 ) 497 498 py_subgroups = [] 499 500 population_value = None 501 population_size = None 502 503 for sg in subgroups.sortSubgroupsByQualityDescending(): 504 py_selectors = [] 505 506 for selector in sg.getDescription(): 507 if isinstance(selector, DefaultSGSelector): 508 py_selectors.append( 509 PyNominalSelector( 510 str(selector.getAttribute().getId()), 511 str(list(selector.getValues())[0]), 512 ) 513 ) 514 515 if isinstance(selector, NumericSelector): 516 lb = selector.getLowerBound() 517 ub = selector.getUpperBound() 518 519 if str(selector.getAttribute().getId()) in ontology.datetime_columns: 520 try: ## convert back to datetime 521 lb = ( 522 pd.to_datetime(int(lb)) 523 .tz_localize("GMT") 524 .tz_convert( 525 ontology.datetime_columns[ 526 str(selector.getAttribute().getId()) 527 ] 528 ) 529 ) 530 except OverflowError: ## if 'inf' 531 lb = float("-inf") 532 533 try: 534 ub = ( 535 pd.to_datetime(int(ub)) 536 .tz_localize("GMT") 537 .tz_convert( 538 ontology.datetime_columns[ 539 str(selector.getAttribute().getId()) 540 ] 541 ) 542 ) 543 except OverflowError: 544 ub = float("inf") 545 546 elif str(selector.getAttribute().getId()) in ontology.timedelta_columns: 547 try: ## convert back to timedelta 548 lb = pd.to_timedelta(int(lb)) 549 except OverflowError: ## if 'inf' 550 lb = float("-inf") 551 552 try: 553 ub = pd.to_timedelta(int(ub)) 554 except OverflowError: 555 ub = float("inf") 556 557 else: 558 lb = float(lb) 559 ub = float(ub) 560 561 py_selectors.append( 562 PyNumericSelector( 563 str(selector.getAttribute().getId()), 564 lb, 565 ub, 566 bool(selector.isIncludeLowerBound()), 567 bool(selector.isIncludeUpperBound()), 568 ) 569 ) 570 571 stats = sg.getStatistics() 572 population_value = float(stats.getTargetQuantityPopulation()) 573 population_size = float(stats.getDefinedPopulationCount()) 574 575 subgroup_value = float(stats.getTargetQuantitySG()) 576 subgroup_size = float(stats.getSubgroupSize()) 577 subgroup_quality = float(sg.getQuality()) 578 579 py_subgroups.append( 580 PySubgroup( 581 py_selectors, 582 subgroup_value, 583 subgroup_size, 584 subgroup_quality, 585 target, 586 target_value, 587 ) 588 ) 589 590 return PySubgroupResults( 591 py_subgroups, population_value, population_size, target, target_value 592 )
Search for interesting subgroups within a dataset.
Parameters
- ontology (pandas DataFrame or PyOntology object.): The data to use to peform subgroup discovery. Can be a pandas DataFrame, or a PyOntology object.
- target (string): The name of the column to be used as the target.
- target_value (object, optional): The value of the target variable that counts as the 'positive' class. Not needed for a numeric target, in which case the mean of the target variable will be used for subgroup discovery.
- included_attributes (list, optional): A list of strings containing the names of columns to use. If not specified, all columns of the data will be used.
- nbins (int, optional): The number of bins to use when discretising numeric variables. Default value is 3.
- method (string, optional):
Used to decide which algorithm to use. Must be one of Beam-Search
beam, BSDbsd, SD-Mapsdmap, SD-Map enabling internal disjunctionssdmap-dis. The default issdmap. - qf (string, optional):
Used to decide which algorithm to use. Must be one of Adjusted Residuals
ares, Binomial Testbin, Chi-Square Testchi2, Gaingain, Liftlift, Piatetsky-Shapirops, Relative Gainrelgain, Weighted Relative Accuracywracc, Wilcoxon-Mann-Whitney Rankwmw, Area-Under-Curveauc. The default is qf =ps. - k (int, optional): Maximum number (top-k) of patterns to discover, i.e., the best k patterns according to the selected quality function. The default is 20.
- minqual (float, optional): The minimal quality. Defaults to 0, meaning there is no minimum.
- minsize (int, optional): The minimum size of a subgroup in order for it to be included in the results. Defaults to 0, meaning there is no minimum.
- mintp (int, optional): The minimum number of true positives in a subgroup (relevant for binary target concepts only). Defaults to 0, meaning there is no minimum
- max_selectors (int, optional): The maximum number of selectors/rules included in a subgroup. The default is 3.
- ignore_defaults (bool, optional): If set to True , the values in the first row of data will be considered ‘default values’, and the same values will be ignored when searching for subgroups. Defaults to False.
- filter_irrelevant (bool, optional): Whether irrelevant patterns are filtered out. Note that this negatively impacts performance. Defaults to False.
- postfilter (string, optional):
Which post-processing filter is applied.
Can be one of:
- Minimum Improvement (Global)
min_improve_global, which checks the patterns against all possible generalisations; - Minimum Improvement (Pattern Set)
min_improve_set, checks the patterns against all their generalisations in the result set, - Relevancy Filter
relevancy, removes patterns that are strictly irrelevant, - Significant Improvement (Global)
sig_improve_global, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all their possible generalizations, - Significant Improvement (Set)
sig_improve_set, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all generalizations in the result set, - Weighted Covering
weighted_covering, performs weighted covering on the data in order to select a covering set of subgroups while reducing the overlap on the data. By default, no postfilter is set, i.e., postfilter = "".
- Minimum Improvement (Global)
- postfilter_param (float, optional): Provides the corresponding parameter value for the filtering chosen in postfilter. Must be provided for most postfiltering types
Returns
- subgroups (PySubgroupResults): The discovered subgroups.