sd4py.sd4py

sd4py is a package that makes it easy to perform subgroup discovery on tabular data. It is extremely simple to use. Call the sd4py.discover_subgroups() function on a pandas dataframe and a collection of subgroups will be returned.

This package provides a Python interface for using the Java application VIKAMINE.

Subgroup discovery is based on finding patterns within some (explanatory) columns of data that then help to explain another (target) column of data. The goal of the subgroup discovery process will be to understand in what circumstances the target is extreme. With a numeric target, this means finding circumstances in which the value is exceptionally high (or exceptionally low) on average. For a non-numeric target, this means looking for circumstances when a particular value is especially likely to occur. One of the key benefits of this approach is that the outputs are interpretable, being expressed as a readable combination of rules like (e.g.) "'Temperature'=high AND 'Pressure'=low".

The package contains a discover_subgroups() function that finds subgroups based on a pandas DataFrame and a specifed target column. The package also includes custom python objects for holding the results.

  1"""
  2sd4py is a package that makes it easy to perform subgroup discovery on tabular data. It is extremely simple to use. Call the `sd4py.discover_subgroups()` function on a pandas dataframe and a collection of subgroups will be returned. 
  3
  4This package provides a Python interface for using the Java application VIKAMINE. 
  5
  6Subgroup discovery is based on finding patterns within some (explanatory) columns of data that then help to explain another (target) column of data. 
  7The goal of the subgroup discovery process will be to understand in what circumstances the target is extreme. With a numeric target, this means finding circumstances in which the value is exceptionally high (or exceptionally low) on average.
  8For a non-numeric target, this means looking for circumstances when a particular value is especially likely to occur.
  9One of the key benefits of this approach is that the outputs are interpretable, being expressed as a readable combination of rules like (e.g.)  "'Temperature'=high AND 'Pressure'=low". 
 10
 11The package contains a `discover_subgroups()` function that finds subgroups based on a pandas `DataFrame` and a specifed target column. The package also includes custom python objects for holding the results. 
 12"""
 13import pkg_resources
 14
 15import jpype
 16import jpype.imports
 17from jpype.types import *
 18
 19jpype.startJVM(classpath=[pkg_resources.resource_filename("sd4py", "vikamine_kernel.jar"), pkg_resources.resource_filename("sd4py", "sd4py.jar")])  ## resource_filename to make it work in a distributed package
 20import java.util.HashSet
 21
 22from org.vikamine.kernel.subgroup.selectors import *
 23from org.sd4py.kernel import *
 24
 25import pandas as pd
 26import numpy as np
 27
 28import copy
 29
 30
 31class PyOntology:
 32    """
 33    Puts data into a Java `Ontology` object for use with the underlying Java subgroup discovery application.
 34
 35    It is not necessary to use this class explicitly; pandas dataframes will automatically be converted into a `PyOntology` object when passed into `discover_subgroups()`.
 36    However, if the dataset is large, and subgroup discovery will be performed multiple times, then the user may opt to convert the dataset into a `PyOntology` to pass into `discover_subgroups()` for the sake of performance.
 37
 38    Attributes
 39    --------------
 40    The only attribute of the class is `ontology`, created during initialisation, which is bound to an `Ontology` object in the Java runtime.
 41    """
 42
 43    def __init__(self, df):
 44        # self.df = df.copy(deep=False)
 45        # self.index = df.index.copy(deep=False)
 46        self.column_names = df.columns.to_list()
 47
 48        self.column_types = []
 49        self.datetime_columns = {}
 50        self.timedelta_columns = []
 51
 52        numeric_arrays = []
 53        nominal_arrays = []
 54
 55        for name, x in iter(df.items()):
 56            if (
 57                x.dtype == "object" or x.dtype == "bool" or x.dtype.name == "category"
 58            ):  # category depends on whether it's ordered?
 59                nominal_arrays.append(JArray(JString)(x.astype(str).values))
 60                self.column_types.append("nominal")
 61
 62            elif np.issubdtype(x.dtype, np.datetime64):
 63                numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
 64                self.column_types.append("numeric")
 65                self.datetime_columns[name] = x.dt.tz
 66
 67            elif np.issubdtype(x.dtype, np.timedelta64):
 68                numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
 69                self.column_types.append("numeric")
 70                self.timedelta_columns.append(name)
 71
 72            elif np.issubdtype(x.dtype, np.number):
 73                if np.issubdtype(
 74                    x.dtype, "float16"
 75                ):  # there is no float16 interface between numpy and jpype, so it would raise an error to use .values
 76                    numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x)))
 77                    self.column_types.append("numeric")
 78
 79                else:
 80                    numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
 81                    self.column_types.append("numeric")
 82
 83            else:
 84                raise ValueError("Unrecognised pandas dtype for :{0}".format(x))
 85
 86        numeric_arrays = JArray(JArray(JDouble))(numeric_arrays)
 87        nominal_arrays = JArray(JArray(JString))(nominal_arrays)
 88
 89        self.ontology = SD4PyOntologyCreator(
 90            JArray(JString)(self.column_names),
 91            JArray(JString)(self.column_types),
 92            numeric_arrays,
 93            nominal_arrays,
 94        ).ontology
 95
 96
 97class PyNumericSelector:
 98    """
 99    Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition.
100    The relevant attribute name in the data is stored in `attribute`.
101    This contains a `numeric lower_bound`, `upper_bound`, plus booleans `include_lower_bound` and `include_upper_bound` to decide whether border values are included in the selection.
102
103    Note that this is detached from the Java runtime, and so is a plain python object.
104    """
105
106    def __init__(
107        self,
108        attribute,
109        lower_bound,
110        upper_bound,
111        include_lower_bound,
112        include_upper_bound,
113    ):
114        self.attribute = attribute
115        self.lower_bound = lower_bound
116        self.upper_bound = upper_bound
117        self.include_lower_bound = include_lower_bound
118        self.include_upper_bound = include_upper_bound
119
120    def __str__(self):
121        out_string = ""
122
123        if isinstance(self.lower_bound, float) and isinstance(self.upper_bound, float):
124            if self.lower_bound != float("-inf"):
125                if self.include_lower_bound:
126                    out_string += "{0:.2f} <= ".format(self.lower_bound)
127                else:
128                    out_string += "{0:.2f} < ".format(self.lower_bound)
129
130            out_string += self.attribute
131
132            if self.upper_bound != float("inf"):
133                if self.include_upper_bound:
134                    out_string += " <= {0:.2f}".format(self.upper_bound)
135                else:
136                    out_string += " < {0:.2f}".format(self.upper_bound)
137
138            return out_string
139
140        else:  ## For datetimes and so on
141            if self.lower_bound != float("-inf"):
142                if self.include_lower_bound:
143                    out_string += "{0} <= ".format(self.lower_bound)
144                else:
145                    out_string += "{0} < ".format(self.lower_bound)
146
147            out_string += self.attribute
148
149            if self.upper_bound != float("inf"):
150                if self.include_upper_bound:
151                    out_string += " <= {0}".format(self.upper_bound)
152                else:
153                    out_string += " < {0}".format(self.upper_bound)
154
155            return out_string
156
157
158class PyNominalSelector:
159    """
160    Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition.
161    It indicates an attribute-value pair through `attribute` and `value`.
162
163    Note that this is detached from the Java runtime, and so is a plain python object.
164    """
165
166    def __init__(self, attribute, value):
167        self.attribute = attribute
168        self.value = value
169
170    def __str__(self):
171        return "{0} = {1}".format(self.attribute, self.value)
172
173
174class PySubgroup:
175    """
176    Represents a subgroup in terms of its selectors, target evaluation, size and quality.
177
178    Note that this is detached from the Java runtime, and so is a plain python object.
179
180    Attributes
181    --------------
182    selectors: list
183        A list of `PySelector` objects representing the rules constituting the subgroup/pattern.
184    target_evaluation: float
185        The value of the target variable for this subgroup (when evaluated against the dataset originally used for subgroup discovery).
186    size: int
187        The number of members in this subgroup (when evaluated against the dataset originally used for subgroup discovery).
188    quality: float
189        The quality of this subgroup (when applying the quality function to the dataset originally used for subgroup discovery).
190    target: string
191        The name of the target column.
192    target_value: object
193        The value of the target variable that counts as the 'positive' class.
194    """
195
196    def __init__(
197        self, selectors, target_evaluation, size, quality, target, target_value
198    ):
199        self.selectors = selectors
200        self.target_evaluation = target_evaluation
201        self.size = size
202        self.quality = quality
203        self.target = target
204        self.target_value = target_value
205
206    def __str__(self):
207        return " AND ".join([str(sel) for sel in self.selectors]).strip()
208
209    def get_indices(self, data):
210        """
211        Get the indices of rows that meet the subgroup definition for a specified dataset.
212
213        Parameters
214        ----------------
215        data: pandas DataFrame
216            The dataset in which to look for (the indices of) rows that match the subgroup definition.
217
218        Returns
219        -----------
220        index: pandas Index
221            The index identifying rows that meet the subgroup definition in the dataset provided.
222        """
223
224        logical_indices = np.ones(data.index.shape, dtype=bool)
225
226        for sel in self.selectors:
227            if isinstance(sel, PyNumericSelector):
228                if sel.include_lower_bound and sel.lower_bound != float("-inf"):
229                    logical_indices = logical_indices & (
230                        data[sel.attribute].values >= sel.lower_bound
231                    )
232                elif sel.lower_bound != float("-inf"):
233                    logical_indices = logical_indices & (
234                        data[sel.attribute].values > sel.lower_bound
235                    )
236                if sel.include_upper_bound and sel.upper_bound != float("inf"):
237                    logical_indices = logical_indices & (
238                        data[sel.attribute].values <= sel.upper_bound
239                    )
240                elif sel.upper_bound != float("inf"):
241                    logical_indices = logical_indices & (
242                        data[sel.attribute].values < sel.upper_bound
243                    )
244
245            if isinstance(sel, PyNominalSelector):
246                logical_indices = logical_indices & (
247                    data[sel.attribute].astype(str).values == sel.value
248                )
249
250        return data.index[logical_indices]
251
252    def get_rows(self, data):
253        """
254        Get the rows that meet the subgroup definition for a specified dataset.
255
256        Parameters
257        ----------------
258        data: pandas DataFrame
259            The dataset in which to look for rows that match the subgroup definition.
260
261        Returns
262        -----------
263        rows: pandas DataFrame
264            A selection of rows in the provided dataset that meet the subgroup definition.
265        """
266
267        return data.loc[self.get_indices(data)]
268
269
270class PySubgroupResults:
271    """
272    A collection of subgroups, returned as a result of performing subgroup discovery.
273
274    Note that this is detached from the Java runtime, and so is a plain python object.
275
276    Attributes
277    --------------
278    subgroups: list
279        A list of `PySubgroup` objects.
280    population_evaluation: float
281        The value of the target variable across the entire dataset originally used for subgroup discovery.
282    population_size: int
283        The number of rows in the dataset originally used for subgroup discovery.
284    target: string
285        The name of the target column.
286    target_value: object
287        The value of the target variable that counts as the 'positive' class.
288    """
289
290    def __init__(
291        self, subgroups, population_evaluation, population_size, target, target_value
292    ):
293        self.subgroups = subgroups
294        self.population_evaluation = population_evaluation
295        self.population_size = population_size
296        self.target = target
297        self.target_value = target_value
298
299    def __len__(self):
300        return len(self.subgroups)
301
302    def __iter__(self):
303        return self.subgroups.__iter__()
304
305    def __getitem__(self, selection):
306        if isinstance(selection, list) and isinstance(selection[0], str):
307            subgroups = [sg for sg in self.subgroups if str(sg) in selection]
308
309            not_present = [
310                sel for sel in selection if sel not in [str(sg) for sg in subgroups]
311            ]
312
313            if len(not_present) > 0:
314                raise ValueError(
315                    "Indices {} not found in {}.".format(not_present, self)
316                )
317
318            else:
319                subgroups = [sg for sg in self.subgroups if str(sg) in selection]
320
321                out = copy.copy(self)
322                out.subgroups = subgroups
323
324                return out
325
326        if isinstance(selection, str):
327            for sg in self.subgroups:
328                if str(sg) == selection:
329                    return sg
330
331            raise ValueError("Index {} not found in {}.".format(not_present, self))
332
333        if hasattr(selection, "__iter__"):
334            subgroups = [self.subgroups[i] for i in selection]
335
336            out = copy.copy(self)
337            out.subgroups = subgroups
338
339            return out
340
341        if isinstance(selection, slice):
342            out = copy.copy(self)
343            out.subgroups = self.subgroups.__getitem__(selection)
344
345            return out
346
347        return self.subgroups[selection]
348
349    def to_df(self):
350        """
351        Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing.
352
353        Returns
354        -----------
355        subgroups_df: pandas DataFrame
356            A table showing the subgroup definitions and associated important values like size, target value, and quality.
357        """
358
359        return pd.DataFrame(
360            [
361                {
362                    "pattern": str(sg),
363                    "target_evaluation": sg.target_evaluation,
364                    "size": sg.size,
365                    "quality": sg.quality,
366                }
367                for sg in self.subgroups
368            ]
369        )
370
371
372def discover_subgroups(
373    ontology,
374    target,
375    target_value=None,
376    included_attributes=None,
377    #    discretise=True,
378    nbins=3,
379    method="sdmap",
380    qf="ps",
381    k=20,
382    minqual=0,
383    minsize=0,
384    mintp=0,
385    max_selectors=3,
386    ignore_defaults=False,
387    filter_irrelevant=False,
388    postfilter="",
389    postfilter_param=0.00,  ## Must be provided for most postfiltering types
390):
391    """
392    Search for interesting subgroups within a dataset.
393
394    Parameters
395    ----------------
396    ontology: pandas DataFrame or PyOntology object.
397        The data to use to peform subgroup discovery. Can be a pandas DataFrame, or a PyOntology object.
398    target: string
399        The name of the column to be used as the target.
400    target_value: object, optional
401        The value of the target variable that counts as the 'positive' class. Not needed for a numeric target, in which case the mean of the target variable will be used for subgroup discovery.
402    included_attributes: list, optional
403        A list of strings containing the names of columns to use. If not specified, all columns of the data will be used.
404    nbins: int, optional
405        The number of bins to use when discretising numeric variables. Default value is 3.
406    method: string, optional
407        Used to decide which algorithm to use. Must be one of Beam-Search `beam`, BSD `bsd`, SD-Map `sdmap`, SD-Map enabling internal disjunctions `sdmap-dis`. The default is `sdmap`.
408    qf: string, optional
409        Used to decide which algorithm to use. Must be one of Adjusted Residuals `ares`, Binomial Test `bin`, Chi-Square Test `chi2`, Gain `gain`, Lift `lift`, Piatetsky-Shapiro `ps`, Relative Gain `relgain`, Weighted Relative Accuracy `wracc`, Wilcoxon-Mann-Whitney Rank `wmw`, Area-Under-Curve `auc`. The default is qf = `ps`.
410    k: int, optional
411        Maximum number (top-k) of patterns to discover, i.e., the best k patterns according to the selected quality function. The default is 20.
412    minqual: float, optional
413        The minimal quality. Defaults to 0, meaning there is no minimum.
414    minsize: int, optional
415        The minimum size of a subgroup in order for it to be included in the results. Defaults to 0, meaning there is no minimum.
416    mintp: int, optional
417        The minimum number of true positives in a subgroup (relevant for binary target concepts only). Defaults to 0, meaning there is no minimum
418    max_selectors: int, optional
419        The maximum number of selectors/rules included in a subgroup. The default is 3.
420    ignore_defaults: bool, optional
421        If set to True , the values in the first row of data will be considered ‘default values’, and the same values will be ignored when searching for subgroups. Defaults to False.
422    filter_irrelevant: bool, optional
423        Whether irrelevant patterns are filtered out. Note that this negatively impacts performance. Defaults to False.
424    postfilter: string, optional
425        Which post-processing filter is applied.
426        Can be one of:
427         * Minimum Improvement (Global) `min_improve_global`, which checks the patterns against all possible generalisations;
428         * Minimum Improvement (Pattern Set) `min_improve_set`, checks the patterns against all their generalisations in the result set,
429         * Relevancy Filter `relevancy`, removes patterns that are strictly irrelevant,
430         * Significant Improvement (Global) `sig_improve_global`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all their possible generalizations,
431         * Significant Improvement (Set) `sig_improve_set`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all generalizations in the result set,
432         * Weighted Covering `weighted_covering`, performs weighted covering on the data in order to select a covering set of subgroups while reducing the overlap on the data.
433        By default, no postfilter is set, i.e., postfilter = "".
434    postfilter_param: float, optional
435        Provides the corresponding parameter value for the filtering chosen in postfilter. Must be provided for most postfiltering types
436
437    Returns
438    -----------
439    subgroups: PySubgroupResults
440        The discovered subgroups.
441    """
442
443    if isinstance(ontology, PyOntology):
444        if target_value is not None:
445            raise ValueError(
446                "target_value cannot be provided when passing in a PyOntology instead of a pands DataFrame."
447            )
448
449    else:
450        if target_value is not None:
451            target_bool = ontology[target] == target_value
452
453            ontology = ontology.drop(columns=target).join(target_bool)
454
455        elif ontology[target].dtype == "bool":
456            target_value = True
457
458        elif (
459            ontology[target].dtype == "object"
460            or ontology[target].dtype.name == "category"
461        ):
462            target_value = ontology[target].iloc[0]
463
464        ontology = PyOntology(
465            ontology.reset_index(drop=True)
466        )  ## Reset index because pandas seems to confuse itself when there is a MultiIndex!!!
467
468    ont = ontology.ontology
469
470    includedAttributes = java.util.HashSet()
471    if included_attributes:
472        includedAttributes.addAll(included_attributes)
473    else:
474        includedAttributes.addAll(ontology.column_names)
475
476    if target not in includedAttributes:
477        includedAttributes.add(target)
478
479    subgroups = SD4Py.discoverSubgroups(
480        ont,
481        JString(target),
482        includedAttributes,
483        JInt(nbins),
484        JString(method),
485        JString(qf),
486        JInt(k),
487        JDouble(minqual),
488        JInt(minsize),
489        JInt(mintp),
490        JInt(max_selectors),
491        JBoolean(ignore_defaults),
492        JBoolean(filter_irrelevant),
493        JString(postfilter),
494        JDouble(postfilter_param),
495    )
496
497    py_subgroups = []
498
499    population_value = None
500    population_size = None
501
502    for sg in subgroups.sortSubgroupsByQualityDescending():
503        py_selectors = []
504
505        for selector in sg.getDescription():
506            if isinstance(selector, DefaultSGSelector):
507                py_selectors.append(
508                    PyNominalSelector(
509                        str(selector.getAttribute().getId()),
510                        str(list(selector.getValues())[0]),
511                    )
512                )
513
514            if isinstance(selector, NumericSelector):
515                lb = selector.getLowerBound()
516                ub = selector.getUpperBound()
517
518                if str(selector.getAttribute().getId()) in ontology.datetime_columns:
519                    try:  ## convert back to datetime
520                        lb = (
521                            pd.to_datetime(int(lb))
522                            .tz_localize("GMT")
523                            .tz_convert(
524                                ontology.datetime_columns[
525                                    str(selector.getAttribute().getId())
526                                ]
527                            )
528                        )
529                    except OverflowError:  ## if 'inf'
530                        lb = float("-inf")
531
532                    try:
533                        ub = (
534                            pd.to_datetime(int(ub))
535                            .tz_localize("GMT")
536                            .tz_convert(
537                                ontology.datetime_columns[
538                                    str(selector.getAttribute().getId())
539                                ]
540                            )
541                        )
542                    except OverflowError:
543                        ub = float("inf")
544
545                elif str(selector.getAttribute().getId()) in ontology.timedelta_columns:
546                    try:  ## convert back to timedelta
547                        lb = pd.to_timedelta(int(lb))
548                    except OverflowError:  ## if 'inf'
549                        lb = float("-inf")
550
551                    try:
552                        ub = pd.to_timedelta(int(ub))
553                    except OverflowError:
554                        ub = float("inf")
555
556                else:
557                    lb = float(lb)
558                    ub = float(ub)
559
560                py_selectors.append(
561                    PyNumericSelector(
562                        str(selector.getAttribute().getId()),
563                        lb,
564                        ub,
565                        bool(selector.isIncludeLowerBound()),
566                        bool(selector.isIncludeUpperBound()),
567                    )
568                )
569
570        stats = sg.getStatistics()
571        population_value = float(stats.getTargetQuantityPopulation())
572        population_size = float(stats.getDefinedPopulationCount())
573
574        subgroup_value = float(stats.getTargetQuantitySG())
575        subgroup_size = float(stats.getSubgroupSize())
576        subgroup_quality = float(sg.getQuality())
577
578        py_subgroups.append(
579            PySubgroup(
580                py_selectors,
581                subgroup_value,
582                subgroup_size,
583                subgroup_quality,
584                target,
585                target_value,
586            )
587        )
588
589    return PySubgroupResults(
590        py_subgroups, population_value, population_size, target, target_value
591    )
class PyOntology:
32class PyOntology:
33    """
34    Puts data into a Java `Ontology` object for use with the underlying Java subgroup discovery application.
35
36    It is not necessary to use this class explicitly; pandas dataframes will automatically be converted into a `PyOntology` object when passed into `discover_subgroups()`.
37    However, if the dataset is large, and subgroup discovery will be performed multiple times, then the user may opt to convert the dataset into a `PyOntology` to pass into `discover_subgroups()` for the sake of performance.
38
39    Attributes
40    --------------
41    The only attribute of the class is `ontology`, created during initialisation, which is bound to an `Ontology` object in the Java runtime.
42    """
43
44    def __init__(self, df):
45        # self.df = df.copy(deep=False)
46        # self.index = df.index.copy(deep=False)
47        self.column_names = df.columns.to_list()
48
49        self.column_types = []
50        self.datetime_columns = {}
51        self.timedelta_columns = []
52
53        numeric_arrays = []
54        nominal_arrays = []
55
56        for name, x in iter(df.items()):
57            if (
58                x.dtype == "object" or x.dtype == "bool" or x.dtype.name == "category"
59            ):  # category depends on whether it's ordered?
60                nominal_arrays.append(JArray(JString)(x.astype(str).values))
61                self.column_types.append("nominal")
62
63            elif np.issubdtype(x.dtype, np.datetime64):
64                numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
65                self.column_types.append("numeric")
66                self.datetime_columns[name] = x.dt.tz
67
68            elif np.issubdtype(x.dtype, np.timedelta64):
69                numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
70                self.column_types.append("numeric")
71                self.timedelta_columns.append(name)
72
73            elif np.issubdtype(x.dtype, np.number):
74                if np.issubdtype(
75                    x.dtype, "float16"
76                ):  # there is no float16 interface between numpy and jpype, so it would raise an error to use .values
77                    numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x)))
78                    self.column_types.append("numeric")
79
80                else:
81                    numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
82                    self.column_types.append("numeric")
83
84            else:
85                raise ValueError("Unrecognised pandas dtype for :{0}".format(x))
86
87        numeric_arrays = JArray(JArray(JDouble))(numeric_arrays)
88        nominal_arrays = JArray(JArray(JString))(nominal_arrays)
89
90        self.ontology = SD4PyOntologyCreator(
91            JArray(JString)(self.column_names),
92            JArray(JString)(self.column_types),
93            numeric_arrays,
94            nominal_arrays,
95        ).ontology

Puts data into a Java Ontology object for use with the underlying Java subgroup discovery application.

It is not necessary to use this class explicitly; pandas dataframes will automatically be converted into a PyOntology object when passed into discover_subgroups(). However, if the dataset is large, and subgroup discovery will be performed multiple times, then the user may opt to convert the dataset into a PyOntology to pass into discover_subgroups() for the sake of performance.

Attributes
  • The only attribute of the class is ontology, created during initialisation, which is bound to an Ontology object in the Java runtime.
PyOntology(df)
44    def __init__(self, df):
45        # self.df = df.copy(deep=False)
46        # self.index = df.index.copy(deep=False)
47        self.column_names = df.columns.to_list()
48
49        self.column_types = []
50        self.datetime_columns = {}
51        self.timedelta_columns = []
52
53        numeric_arrays = []
54        nominal_arrays = []
55
56        for name, x in iter(df.items()):
57            if (
58                x.dtype == "object" or x.dtype == "bool" or x.dtype.name == "category"
59            ):  # category depends on whether it's ordered?
60                nominal_arrays.append(JArray(JString)(x.astype(str).values))
61                self.column_types.append("nominal")
62
63            elif np.issubdtype(x.dtype, np.datetime64):
64                numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
65                self.column_types.append("numeric")
66                self.datetime_columns[name] = x.dt.tz
67
68            elif np.issubdtype(x.dtype, np.timedelta64):
69                numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
70                self.column_types.append("numeric")
71                self.timedelta_columns.append(name)
72
73            elif np.issubdtype(x.dtype, np.number):
74                if np.issubdtype(
75                    x.dtype, "float16"
76                ):  # there is no float16 interface between numpy and jpype, so it would raise an error to use .values
77                    numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x)))
78                    self.column_types.append("numeric")
79
80                else:
81                    numeric_arrays.append(JArray(JDouble)(pd.to_numeric(x).values))
82                    self.column_types.append("numeric")
83
84            else:
85                raise ValueError("Unrecognised pandas dtype for :{0}".format(x))
86
87        numeric_arrays = JArray(JArray(JDouble))(numeric_arrays)
88        nominal_arrays = JArray(JArray(JString))(nominal_arrays)
89
90        self.ontology = SD4PyOntologyCreator(
91            JArray(JString)(self.column_names),
92            JArray(JString)(self.column_types),
93            numeric_arrays,
94            nominal_arrays,
95        ).ontology
column_names
column_types
datetime_columns
timedelta_columns
ontology
class PyNumericSelector:
 98class PyNumericSelector:
 99    """
100    Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition.
101    The relevant attribute name in the data is stored in `attribute`.
102    This contains a `numeric lower_bound`, `upper_bound`, plus booleans `include_lower_bound` and `include_upper_bound` to decide whether border values are included in the selection.
103
104    Note that this is detached from the Java runtime, and so is a plain python object.
105    """
106
107    def __init__(
108        self,
109        attribute,
110        lower_bound,
111        upper_bound,
112        include_lower_bound,
113        include_upper_bound,
114    ):
115        self.attribute = attribute
116        self.lower_bound = lower_bound
117        self.upper_bound = upper_bound
118        self.include_lower_bound = include_lower_bound
119        self.include_upper_bound = include_upper_bound
120
121    def __str__(self):
122        out_string = ""
123
124        if isinstance(self.lower_bound, float) and isinstance(self.upper_bound, float):
125            if self.lower_bound != float("-inf"):
126                if self.include_lower_bound:
127                    out_string += "{0:.2f} <= ".format(self.lower_bound)
128                else:
129                    out_string += "{0:.2f} < ".format(self.lower_bound)
130
131            out_string += self.attribute
132
133            if self.upper_bound != float("inf"):
134                if self.include_upper_bound:
135                    out_string += " <= {0:.2f}".format(self.upper_bound)
136                else:
137                    out_string += " < {0:.2f}".format(self.upper_bound)
138
139            return out_string
140
141        else:  ## For datetimes and so on
142            if self.lower_bound != float("-inf"):
143                if self.include_lower_bound:
144                    out_string += "{0} <= ".format(self.lower_bound)
145                else:
146                    out_string += "{0} < ".format(self.lower_bound)
147
148            out_string += self.attribute
149
150            if self.upper_bound != float("inf"):
151                if self.include_upper_bound:
152                    out_string += " <= {0}".format(self.upper_bound)
153                else:
154                    out_string += " < {0}".format(self.upper_bound)
155
156            return out_string

Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition. The relevant attribute name in the data is stored in attribute. This contains a numeric lower_bound, upper_bound, plus booleans include_lower_bound and include_upper_bound to decide whether border values are included in the selection.

Note that this is detached from the Java runtime, and so is a plain python object.

PyNumericSelector( attribute, lower_bound, upper_bound, include_lower_bound, include_upper_bound)
107    def __init__(
108        self,
109        attribute,
110        lower_bound,
111        upper_bound,
112        include_lower_bound,
113        include_upper_bound,
114    ):
115        self.attribute = attribute
116        self.lower_bound = lower_bound
117        self.upper_bound = upper_bound
118        self.include_lower_bound = include_lower_bound
119        self.include_upper_bound = include_upper_bound
attribute
lower_bound
upper_bound
include_lower_bound
include_upper_bound
class PyNominalSelector:
159class PyNominalSelector:
160    """
161    Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition.
162    It indicates an attribute-value pair through `attribute` and `value`.
163
164    Note that this is detached from the Java runtime, and so is a plain python object.
165    """
166
167    def __init__(self, attribute, value):
168        self.attribute = attribute
169        self.value = value
170
171    def __str__(self):
172        return "{0} = {1}".format(self.attribute, self.value)

Represents a rule to select a subset of data, which combines with other selectors to form the subgroup/pattern definition. It indicates an attribute-value pair through attribute and value.

Note that this is detached from the Java runtime, and so is a plain python object.

PyNominalSelector(attribute, value)
167    def __init__(self, attribute, value):
168        self.attribute = attribute
169        self.value = value
attribute
value
class PySubgroup:
175class PySubgroup:
176    """
177    Represents a subgroup in terms of its selectors, target evaluation, size and quality.
178
179    Note that this is detached from the Java runtime, and so is a plain python object.
180
181    Attributes
182    --------------
183    selectors: list
184        A list of `PySelector` objects representing the rules constituting the subgroup/pattern.
185    target_evaluation: float
186        The value of the target variable for this subgroup (when evaluated against the dataset originally used for subgroup discovery).
187    size: int
188        The number of members in this subgroup (when evaluated against the dataset originally used for subgroup discovery).
189    quality: float
190        The quality of this subgroup (when applying the quality function to the dataset originally used for subgroup discovery).
191    target: string
192        The name of the target column.
193    target_value: object
194        The value of the target variable that counts as the 'positive' class.
195    """
196
197    def __init__(
198        self, selectors, target_evaluation, size, quality, target, target_value
199    ):
200        self.selectors = selectors
201        self.target_evaluation = target_evaluation
202        self.size = size
203        self.quality = quality
204        self.target = target
205        self.target_value = target_value
206
207    def __str__(self):
208        return " AND ".join([str(sel) for sel in self.selectors]).strip()
209
210    def get_indices(self, data):
211        """
212        Get the indices of rows that meet the subgroup definition for a specified dataset.
213
214        Parameters
215        ----------------
216        data: pandas DataFrame
217            The dataset in which to look for (the indices of) rows that match the subgroup definition.
218
219        Returns
220        -----------
221        index: pandas Index
222            The index identifying rows that meet the subgroup definition in the dataset provided.
223        """
224
225        logical_indices = np.ones(data.index.shape, dtype=bool)
226
227        for sel in self.selectors:
228            if isinstance(sel, PyNumericSelector):
229                if sel.include_lower_bound and sel.lower_bound != float("-inf"):
230                    logical_indices = logical_indices & (
231                        data[sel.attribute].values >= sel.lower_bound
232                    )
233                elif sel.lower_bound != float("-inf"):
234                    logical_indices = logical_indices & (
235                        data[sel.attribute].values > sel.lower_bound
236                    )
237                if sel.include_upper_bound and sel.upper_bound != float("inf"):
238                    logical_indices = logical_indices & (
239                        data[sel.attribute].values <= sel.upper_bound
240                    )
241                elif sel.upper_bound != float("inf"):
242                    logical_indices = logical_indices & (
243                        data[sel.attribute].values < sel.upper_bound
244                    )
245
246            if isinstance(sel, PyNominalSelector):
247                logical_indices = logical_indices & (
248                    data[sel.attribute].astype(str).values == sel.value
249                )
250
251        return data.index[logical_indices]
252
253    def get_rows(self, data):
254        """
255        Get the rows that meet the subgroup definition for a specified dataset.
256
257        Parameters
258        ----------------
259        data: pandas DataFrame
260            The dataset in which to look for rows that match the subgroup definition.
261
262        Returns
263        -----------
264        rows: pandas DataFrame
265            A selection of rows in the provided dataset that meet the subgroup definition.
266        """
267
268        return data.loc[self.get_indices(data)]

Represents a subgroup in terms of its selectors, target evaluation, size and quality.

Note that this is detached from the Java runtime, and so is a plain python object.

Attributes
  • selectors (list): A list of PySelector objects representing the rules constituting the subgroup/pattern.
  • target_evaluation (float): The value of the target variable for this subgroup (when evaluated against the dataset originally used for subgroup discovery).
  • size (int): The number of members in this subgroup (when evaluated against the dataset originally used for subgroup discovery).
  • quality (float): The quality of this subgroup (when applying the quality function to the dataset originally used for subgroup discovery).
  • target (string): The name of the target column.
  • target_value (object): The value of the target variable that counts as the 'positive' class.
PySubgroup(selectors, target_evaluation, size, quality, target, target_value)
197    def __init__(
198        self, selectors, target_evaluation, size, quality, target, target_value
199    ):
200        self.selectors = selectors
201        self.target_evaluation = target_evaluation
202        self.size = size
203        self.quality = quality
204        self.target = target
205        self.target_value = target_value
selectors
target_evaluation
size
quality
target
target_value
def get_indices(self, data):
210    def get_indices(self, data):
211        """
212        Get the indices of rows that meet the subgroup definition for a specified dataset.
213
214        Parameters
215        ----------------
216        data: pandas DataFrame
217            The dataset in which to look for (the indices of) rows that match the subgroup definition.
218
219        Returns
220        -----------
221        index: pandas Index
222            The index identifying rows that meet the subgroup definition in the dataset provided.
223        """
224
225        logical_indices = np.ones(data.index.shape, dtype=bool)
226
227        for sel in self.selectors:
228            if isinstance(sel, PyNumericSelector):
229                if sel.include_lower_bound and sel.lower_bound != float("-inf"):
230                    logical_indices = logical_indices & (
231                        data[sel.attribute].values >= sel.lower_bound
232                    )
233                elif sel.lower_bound != float("-inf"):
234                    logical_indices = logical_indices & (
235                        data[sel.attribute].values > sel.lower_bound
236                    )
237                if sel.include_upper_bound and sel.upper_bound != float("inf"):
238                    logical_indices = logical_indices & (
239                        data[sel.attribute].values <= sel.upper_bound
240                    )
241                elif sel.upper_bound != float("inf"):
242                    logical_indices = logical_indices & (
243                        data[sel.attribute].values < sel.upper_bound
244                    )
245
246            if isinstance(sel, PyNominalSelector):
247                logical_indices = logical_indices & (
248                    data[sel.attribute].astype(str).values == sel.value
249                )
250
251        return data.index[logical_indices]

Get the indices of rows that meet the subgroup definition for a specified dataset.

Parameters
  • data (pandas DataFrame): The dataset in which to look for (the indices of) rows that match the subgroup definition.
Returns
  • index (pandas Index): The index identifying rows that meet the subgroup definition in the dataset provided.
def get_rows(self, data):
253    def get_rows(self, data):
254        """
255        Get the rows that meet the subgroup definition for a specified dataset.
256
257        Parameters
258        ----------------
259        data: pandas DataFrame
260            The dataset in which to look for rows that match the subgroup definition.
261
262        Returns
263        -----------
264        rows: pandas DataFrame
265            A selection of rows in the provided dataset that meet the subgroup definition.
266        """
267
268        return data.loc[self.get_indices(data)]

Get the rows that meet the subgroup definition for a specified dataset.

Parameters
  • data (pandas DataFrame): The dataset in which to look for rows that match the subgroup definition.
Returns
  • rows (pandas DataFrame): A selection of rows in the provided dataset that meet the subgroup definition.
class PySubgroupResults:
271class PySubgroupResults:
272    """
273    A collection of subgroups, returned as a result of performing subgroup discovery.
274
275    Note that this is detached from the Java runtime, and so is a plain python object.
276
277    Attributes
278    --------------
279    subgroups: list
280        A list of `PySubgroup` objects.
281    population_evaluation: float
282        The value of the target variable across the entire dataset originally used for subgroup discovery.
283    population_size: int
284        The number of rows in the dataset originally used for subgroup discovery.
285    target: string
286        The name of the target column.
287    target_value: object
288        The value of the target variable that counts as the 'positive' class.
289    """
290
291    def __init__(
292        self, subgroups, population_evaluation, population_size, target, target_value
293    ):
294        self.subgroups = subgroups
295        self.population_evaluation = population_evaluation
296        self.population_size = population_size
297        self.target = target
298        self.target_value = target_value
299
300    def __len__(self):
301        return len(self.subgroups)
302
303    def __iter__(self):
304        return self.subgroups.__iter__()
305
306    def __getitem__(self, selection):
307        if isinstance(selection, list) and isinstance(selection[0], str):
308            subgroups = [sg for sg in self.subgroups if str(sg) in selection]
309
310            not_present = [
311                sel for sel in selection if sel not in [str(sg) for sg in subgroups]
312            ]
313
314            if len(not_present) > 0:
315                raise ValueError(
316                    "Indices {} not found in {}.".format(not_present, self)
317                )
318
319            else:
320                subgroups = [sg for sg in self.subgroups if str(sg) in selection]
321
322                out = copy.copy(self)
323                out.subgroups = subgroups
324
325                return out
326
327        if isinstance(selection, str):
328            for sg in self.subgroups:
329                if str(sg) == selection:
330                    return sg
331
332            raise ValueError("Index {} not found in {}.".format(not_present, self))
333
334        if hasattr(selection, "__iter__"):
335            subgroups = [self.subgroups[i] for i in selection]
336
337            out = copy.copy(self)
338            out.subgroups = subgroups
339
340            return out
341
342        if isinstance(selection, slice):
343            out = copy.copy(self)
344            out.subgroups = self.subgroups.__getitem__(selection)
345
346            return out
347
348        return self.subgroups[selection]
349
350    def to_df(self):
351        """
352        Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing.
353
354        Returns
355        -----------
356        subgroups_df: pandas DataFrame
357            A table showing the subgroup definitions and associated important values like size, target value, and quality.
358        """
359
360        return pd.DataFrame(
361            [
362                {
363                    "pattern": str(sg),
364                    "target_evaluation": sg.target_evaluation,
365                    "size": sg.size,
366                    "quality": sg.quality,
367                }
368                for sg in self.subgroups
369            ]
370        )

A collection of subgroups, returned as a result of performing subgroup discovery.

Note that this is detached from the Java runtime, and so is a plain python object.

Attributes
  • subgroups (list): A list of PySubgroup objects.
  • population_evaluation (float): The value of the target variable across the entire dataset originally used for subgroup discovery.
  • population_size (int): The number of rows in the dataset originally used for subgroup discovery.
  • target (string): The name of the target column.
  • target_value (object): The value of the target variable that counts as the 'positive' class.
PySubgroupResults( subgroups, population_evaluation, population_size, target, target_value)
291    def __init__(
292        self, subgroups, population_evaluation, population_size, target, target_value
293    ):
294        self.subgroups = subgroups
295        self.population_evaluation = population_evaluation
296        self.population_size = population_size
297        self.target = target
298        self.target_value = target_value
subgroups
population_evaluation
population_size
target
target_value
def to_df(self):
350    def to_df(self):
351        """
352        Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing.
353
354        Returns
355        -----------
356        subgroups_df: pandas DataFrame
357            A table showing the subgroup definitions and associated important values like size, target value, and quality.
358        """
359
360        return pd.DataFrame(
361            [
362                {
363                    "pattern": str(sg),
364                    "target_evaluation": sg.target_evaluation,
365                    "size": sg.size,
366                    "quality": sg.quality,
367                }
368                for sg in self.subgroups
369            ]
370        )

Convert the subgroups included in this object into an easy-to-read pandas dataframe for viewing.

Returns
  • subgroups_df (pandas DataFrame): A table showing the subgroup definitions and associated important values like size, target value, and quality.
def discover_subgroups( ontology, target, target_value=None, included_attributes=None, nbins=3, method='sdmap', qf='ps', k=20, minqual=0, minsize=0, mintp=0, max_selectors=3, ignore_defaults=False, filter_irrelevant=False, postfilter='', postfilter_param=0.0):
373def discover_subgroups(
374    ontology,
375    target,
376    target_value=None,
377    included_attributes=None,
378    #    discretise=True,
379    nbins=3,
380    method="sdmap",
381    qf="ps",
382    k=20,
383    minqual=0,
384    minsize=0,
385    mintp=0,
386    max_selectors=3,
387    ignore_defaults=False,
388    filter_irrelevant=False,
389    postfilter="",
390    postfilter_param=0.00,  ## Must be provided for most postfiltering types
391):
392    """
393    Search for interesting subgroups within a dataset.
394
395    Parameters
396    ----------------
397    ontology: pandas DataFrame or PyOntology object.
398        The data to use to peform subgroup discovery. Can be a pandas DataFrame, or a PyOntology object.
399    target: string
400        The name of the column to be used as the target.
401    target_value: object, optional
402        The value of the target variable that counts as the 'positive' class. Not needed for a numeric target, in which case the mean of the target variable will be used for subgroup discovery.
403    included_attributes: list, optional
404        A list of strings containing the names of columns to use. If not specified, all columns of the data will be used.
405    nbins: int, optional
406        The number of bins to use when discretising numeric variables. Default value is 3.
407    method: string, optional
408        Used to decide which algorithm to use. Must be one of Beam-Search `beam`, BSD `bsd`, SD-Map `sdmap`, SD-Map enabling internal disjunctions `sdmap-dis`. The default is `sdmap`.
409    qf: string, optional
410        Used to decide which algorithm to use. Must be one of Adjusted Residuals `ares`, Binomial Test `bin`, Chi-Square Test `chi2`, Gain `gain`, Lift `lift`, Piatetsky-Shapiro `ps`, Relative Gain `relgain`, Weighted Relative Accuracy `wracc`, Wilcoxon-Mann-Whitney Rank `wmw`, Area-Under-Curve `auc`. The default is qf = `ps`.
411    k: int, optional
412        Maximum number (top-k) of patterns to discover, i.e., the best k patterns according to the selected quality function. The default is 20.
413    minqual: float, optional
414        The minimal quality. Defaults to 0, meaning there is no minimum.
415    minsize: int, optional
416        The minimum size of a subgroup in order for it to be included in the results. Defaults to 0, meaning there is no minimum.
417    mintp: int, optional
418        The minimum number of true positives in a subgroup (relevant for binary target concepts only). Defaults to 0, meaning there is no minimum
419    max_selectors: int, optional
420        The maximum number of selectors/rules included in a subgroup. The default is 3.
421    ignore_defaults: bool, optional
422        If set to True , the values in the first row of data will be considered ‘default values’, and the same values will be ignored when searching for subgroups. Defaults to False.
423    filter_irrelevant: bool, optional
424        Whether irrelevant patterns are filtered out. Note that this negatively impacts performance. Defaults to False.
425    postfilter: string, optional
426        Which post-processing filter is applied.
427        Can be one of:
428         * Minimum Improvement (Global) `min_improve_global`, which checks the patterns against all possible generalisations;
429         * Minimum Improvement (Pattern Set) `min_improve_set`, checks the patterns against all their generalisations in the result set,
430         * Relevancy Filter `relevancy`, removes patterns that are strictly irrelevant,
431         * Significant Improvement (Global) `sig_improve_global`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all their possible generalizations,
432         * Significant Improvement (Set) `sig_improve_set`, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all generalizations in the result set,
433         * Weighted Covering `weighted_covering`, performs weighted covering on the data in order to select a covering set of subgroups while reducing the overlap on the data.
434        By default, no postfilter is set, i.e., postfilter = "".
435    postfilter_param: float, optional
436        Provides the corresponding parameter value for the filtering chosen in postfilter. Must be provided for most postfiltering types
437
438    Returns
439    -----------
440    subgroups: PySubgroupResults
441        The discovered subgroups.
442    """
443
444    if isinstance(ontology, PyOntology):
445        if target_value is not None:
446            raise ValueError(
447                "target_value cannot be provided when passing in a PyOntology instead of a pands DataFrame."
448            )
449
450    else:
451        if target_value is not None:
452            target_bool = ontology[target] == target_value
453
454            ontology = ontology.drop(columns=target).join(target_bool)
455
456        elif ontology[target].dtype == "bool":
457            target_value = True
458
459        elif (
460            ontology[target].dtype == "object"
461            or ontology[target].dtype.name == "category"
462        ):
463            target_value = ontology[target].iloc[0]
464
465        ontology = PyOntology(
466            ontology.reset_index(drop=True)
467        )  ## Reset index because pandas seems to confuse itself when there is a MultiIndex!!!
468
469    ont = ontology.ontology
470
471    includedAttributes = java.util.HashSet()
472    if included_attributes:
473        includedAttributes.addAll(included_attributes)
474    else:
475        includedAttributes.addAll(ontology.column_names)
476
477    if target not in includedAttributes:
478        includedAttributes.add(target)
479
480    subgroups = SD4Py.discoverSubgroups(
481        ont,
482        JString(target),
483        includedAttributes,
484        JInt(nbins),
485        JString(method),
486        JString(qf),
487        JInt(k),
488        JDouble(minqual),
489        JInt(minsize),
490        JInt(mintp),
491        JInt(max_selectors),
492        JBoolean(ignore_defaults),
493        JBoolean(filter_irrelevant),
494        JString(postfilter),
495        JDouble(postfilter_param),
496    )
497
498    py_subgroups = []
499
500    population_value = None
501    population_size = None
502
503    for sg in subgroups.sortSubgroupsByQualityDescending():
504        py_selectors = []
505
506        for selector in sg.getDescription():
507            if isinstance(selector, DefaultSGSelector):
508                py_selectors.append(
509                    PyNominalSelector(
510                        str(selector.getAttribute().getId()),
511                        str(list(selector.getValues())[0]),
512                    )
513                )
514
515            if isinstance(selector, NumericSelector):
516                lb = selector.getLowerBound()
517                ub = selector.getUpperBound()
518
519                if str(selector.getAttribute().getId()) in ontology.datetime_columns:
520                    try:  ## convert back to datetime
521                        lb = (
522                            pd.to_datetime(int(lb))
523                            .tz_localize("GMT")
524                            .tz_convert(
525                                ontology.datetime_columns[
526                                    str(selector.getAttribute().getId())
527                                ]
528                            )
529                        )
530                    except OverflowError:  ## if 'inf'
531                        lb = float("-inf")
532
533                    try:
534                        ub = (
535                            pd.to_datetime(int(ub))
536                            .tz_localize("GMT")
537                            .tz_convert(
538                                ontology.datetime_columns[
539                                    str(selector.getAttribute().getId())
540                                ]
541                            )
542                        )
543                    except OverflowError:
544                        ub = float("inf")
545
546                elif str(selector.getAttribute().getId()) in ontology.timedelta_columns:
547                    try:  ## convert back to timedelta
548                        lb = pd.to_timedelta(int(lb))
549                    except OverflowError:  ## if 'inf'
550                        lb = float("-inf")
551
552                    try:
553                        ub = pd.to_timedelta(int(ub))
554                    except OverflowError:
555                        ub = float("inf")
556
557                else:
558                    lb = float(lb)
559                    ub = float(ub)
560
561                py_selectors.append(
562                    PyNumericSelector(
563                        str(selector.getAttribute().getId()),
564                        lb,
565                        ub,
566                        bool(selector.isIncludeLowerBound()),
567                        bool(selector.isIncludeUpperBound()),
568                    )
569                )
570
571        stats = sg.getStatistics()
572        population_value = float(stats.getTargetQuantityPopulation())
573        population_size = float(stats.getDefinedPopulationCount())
574
575        subgroup_value = float(stats.getTargetQuantitySG())
576        subgroup_size = float(stats.getSubgroupSize())
577        subgroup_quality = float(sg.getQuality())
578
579        py_subgroups.append(
580            PySubgroup(
581                py_selectors,
582                subgroup_value,
583                subgroup_size,
584                subgroup_quality,
585                target,
586                target_value,
587            )
588        )
589
590    return PySubgroupResults(
591        py_subgroups, population_value, population_size, target, target_value
592    )

Search for interesting subgroups within a dataset.

Parameters
  • ontology (pandas DataFrame or PyOntology object.): The data to use to peform subgroup discovery. Can be a pandas DataFrame, or a PyOntology object.
  • target (string): The name of the column to be used as the target.
  • target_value (object, optional): The value of the target variable that counts as the 'positive' class. Not needed for a numeric target, in which case the mean of the target variable will be used for subgroup discovery.
  • included_attributes (list, optional): A list of strings containing the names of columns to use. If not specified, all columns of the data will be used.
  • nbins (int, optional): The number of bins to use when discretising numeric variables. Default value is 3.
  • method (string, optional): Used to decide which algorithm to use. Must be one of Beam-Search beam, BSD bsd, SD-Map sdmap, SD-Map enabling internal disjunctions sdmap-dis. The default is sdmap.
  • qf (string, optional): Used to decide which algorithm to use. Must be one of Adjusted Residuals ares, Binomial Test bin, Chi-Square Test chi2, Gain gain, Lift lift, Piatetsky-Shapiro ps, Relative Gain relgain, Weighted Relative Accuracy wracc, Wilcoxon-Mann-Whitney Rank wmw, Area-Under-Curve auc. The default is qf = ps.
  • k (int, optional): Maximum number (top-k) of patterns to discover, i.e., the best k patterns according to the selected quality function. The default is 20.
  • minqual (float, optional): The minimal quality. Defaults to 0, meaning there is no minimum.
  • minsize (int, optional): The minimum size of a subgroup in order for it to be included in the results. Defaults to 0, meaning there is no minimum.
  • mintp (int, optional): The minimum number of true positives in a subgroup (relevant for binary target concepts only). Defaults to 0, meaning there is no minimum
  • max_selectors (int, optional): The maximum number of selectors/rules included in a subgroup. The default is 3.
  • ignore_defaults (bool, optional): If set to True , the values in the first row of data will be considered ‘default values’, and the same values will be ignored when searching for subgroups. Defaults to False.
  • filter_irrelevant (bool, optional): Whether irrelevant patterns are filtered out. Note that this negatively impacts performance. Defaults to False.
  • postfilter (string, optional): Which post-processing filter is applied. Can be one of:
    • Minimum Improvement (Global) min_improve_global, which checks the patterns against all possible generalisations;
    • Minimum Improvement (Pattern Set) min_improve_set, checks the patterns against all their generalisations in the result set,
    • Relevancy Filter relevancy, removes patterns that are strictly irrelevant,
    • Significant Improvement (Global) sig_improve_global, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all their possible generalizations,
    • Significant Improvement (Set) sig_improve_set, removes patterns that do not significantly improve (default 0.01 level, can be overridden with postfilter_param) with respect to all generalizations in the result set,
    • Weighted Covering weighted_covering, performs weighted covering on the data in order to select a covering set of subgroups while reducing the overlap on the data. By default, no postfilter is set, i.e., postfilter = "".
  • postfilter_param (float, optional): Provides the corresponding parameter value for the filtering chosen in postfilter. Must be provided for most postfiltering types
Returns
  • subgroups (PySubgroupResults): The discovered subgroups.