PRESENTATION OF MULTIVARIATE DATA

PRESENTATION OF MULTIVARIATE DATA CATEGORICAL AND QUANTITATIVE EXAMPLE Old Versus Young The effect of age and education on musical taste can be investigated by breaking the observations down into more homogenous groups. The most obvious split is by age. There are 1300 older people and 1000 younger people. This is almost certainly a result of the way in which the sample was taken. EXAMPLE (contd.) Education Level Within the old and young groups we can now find the proportions falling into each of the high and low education categories. The young group is clearly more highly educated than the old group. EXAMPLE (contd.) Summary The result of our “analysis” is a series of tables. From these tables we can see: 1. There are slightly more old people than young people in the sampled group. 2. The younger people are more highly educated than the older ones. 3. The likelihood of listening to classical music depends on both age and education level. MOSAIC PLOTS • Mosaic plots give a graphical representation of these successive decompositions. • Counts are represented by rectangles. • At each stage of plot creation, the rectangles are split parallel to one of the two axes. MOSAIC PLOTS Creating Mosaic Plots • In order to produce a mosaic plot it is necessary to have: – A contingency table containing the data. – A preferred ordering of the variables, with the “response” variable last. MOSAIC PLOTS Entering the Data • To enter the data, we must settle on the order in which the values. • The order of values in an R array is with the first subscript varying most quickly, the second subscript varying next most quickly, etc. • In the case of the music data we can take the first subscript to correspond to Age, the second to Education and the third to Listening. • The steps are then (i) entering the data, (ii) shaping it as an array and (iii) labeling the extents. Order for Data Entry MOSAIC PLOTS MOSAIC PLOTS MOSAIC PLOTS MOSAIC PLOTS MOSAIC PLOTS • Example: The mortality rates aboard the Titanic, which are influenced strongly by age, sex, and passenger class. If you wanted to compare the mortality rates between men and women using a mosaic plot, you would first divide the unit square according to the overall proportion of males and females. TITANIC EXAMPLE • The mortality rates aboard the Titanic, which are influenced strongly by age, sex, and passenger class. If you wanted to compare the mortality rates between men and women using a mosaic plot, you would first divide the unit square according to the overall proportion of males and females. TITANIC EXAMPLE •Roughly 35% of the passengers were female, so the first split of the mosaic plot is 35/65. •Next, split each bar vertically according to the proportion who lived and died. TITANIC EXAMPLE Among females, 67% survived (coded as 1 on this plot) and 33% died (coded as 0). So the female bar shows as 67/33 split. Among males, only 17% survived, so this bar shows a 17/83 split. • Most implementations of the mosaic plot offer as a default a small margin around each cell to make the graph easier to read. •This plot shows you that males were the majority of the deaths and the minority of the survivors. •As a general recommendation, variables that represent an exposure or treatment status should usually represent the first split and variables that represent an outcome should represent the second split. • Here is a mosaic plot looking at the relationship between passenger class and mortality. The survival rate is best among first class passengers and worst among third class passengers. JITTERED SCATTERPLOT • The jitter function adds a small random quantity to the data coordinates thus serving to separate the overplotted points. Jittering Provides a clearer view of overlapping points. Simpsons's Paradox • Sometimes averages can be misleading. Sometimes they don’t make any sense. Be careful when averaging different variables. • Let’s see the Centerville sign. Entering Centerville Established Population Elevation Average 1793 7943 710 3482 Simpsons's Paradox - When Big Data Sets Go Bad By Smita Skrivanek, Principal Statistician, MoreSteam.com LLC • It's a well accepted rule of thumb that the larger the data set, the more reliable the conclusions drawn. • Simpson' paradox, however, demonstrates that a great deal of care has to be taken when combining small data sets into a large one. Sometimes conclusions from the large data set are exactly the opposite of conclusion from the smaller sets. Unfortunately, the conclusions from the large set are also usually wrong. EXAMPLE • You’re in charge of a study that compares how two weight loss techniques – Diet and Exercise – affect the weight loss of overweight patients. • Overall, you had 240 patients participate in the study, with 120 assigned to a weight‐loss diet and the remaining 120 assigned to a supervised exercise regimen. • At the end of 30 days, you measured each group’s weight loss. The data showed that 70 dieters and 57 exercisers lost significant weight, representing 58% in the diet group and only 48% in the exercise group – a significant difference. So, should you conclude that diet is better than exercise? • No, and this is why Simpson’s Paradox can be so tricky! When the data are stratified instead by the starting Body Mass Index (BMI) of the participants, as shown below, clearer picture emerges: • When examined by BMI group, you can clearly see that the percentage of patients who lost weight in each BMI group was smaller among the dieters than among the exercisers. The surprising variable was the number of obese and severely obese individuals between the diet and exercise groups. Because those numbers were flipped, the overall percentages of successful weight loss are reversed (higher for the diet group). • Simpsons’s Paradox at Work: The percentage of patients who lost weight was higher for exercisers among both obese and severely obese patients, but when you aggregate the two groups, the dieters appear to do better. • Why Did this Happen? • Two factors are at play here. First, there is an overlooked confounding variable (BMI), and second, a disproportionate allocation of BMI levels among the experimental (diet and exercise) groups. We do not know the reason for the disproportionate allocation, but we might guess that the patients somehow self‐selected themselves into the two groups. The proportions of dieters and exercisers in each BMI group The proportions of weight loss and non‐weight loss patients among the different subgroups It is clear that more exercisers lost weight in each BMI group but that in the aggregated sample the proportions seem to be reversed. EXAMPLE • Suppose there are two pilots, Moe and Jill. Moe argues that he is the better pilot of the two, since he managed to land 83% of his last 120 flights on time compared with Jill’s 78%. Let’s look at the data more closely. Here are the results of their last 120 flights, broken down by the time of the day they flew: Time of Day Pilots Moe Jill Day Night Overall 90 out of 100 10out of 20 100 out of 120 90% 50% 83% 19 out of 20 75 out of 100 94 out of 120 95% 75% 78% •Look at the day and night separately. For day flights, Jill had a 95% on time rate, and Moe only a 90% rate. At night, Jill was on time 75% of the time, and Moe only 50%. So, Moe is better “overall”, but Jill is better both during the day. •Problem here is the unfair averaging over different groups. Jill has mostly night flights, which are more difficult. So, her average is heavily influenced by her night time average. Moe, on the other hand, benefits from flying mostly during the day, with higher on time percentage. With their different patterns of flying conditions, taking an overall average is misleading. How to Avoid the Paradox • To avoid spurious results, it is always good practice to examine whether the relationship in the aggregated dataset holds up in it subsets, especially when some groups are not equally represented as others in the data. • Another way may be to weight the samples according to their sizes. • Proper randomization also goes a long way in minimizing the effects of a lurking variable that might have been missed. • Unfortunately, statistical analysis tools are just that – tools to help you organize and analyze the observed data. They cannot tell you anything about data that were not observed or not included in the analysis. • So it is very important to involve a cross‐functional team and especially subject matter experts and practitioners in the initial planning and selection of the variables to be measured. After they collect the data, the only way to try to avoid this pitfall is to visually and otherwise examine meaningful subsets of the data. • If you don’t have the option of planning the study but are given the data from a database and asked to “find what you can”, the lesson of Simpson's Paradox is to always look at the data at several levels of aggregation, as in the example above. MULTIVARIATE CONTINUOUS DATA • • • • Matrix scatter plots Comparative boxplots Comparative Violin plots 3D graphics MATRIX SCATTER PLOTS • Scatterplot matrix is an extension for multidimensional data where a collection of scatterplots is organized in a matrix simultaneously to provide correlation information among the attributes. • We can easily observe patterns in the relationships between pairs of attributes from the matrix. MATRIX SCATTER PLOTS A scatterplot matrix for 5-dimensional data of 400 automobiles Automobiles are color-coded by the number of cylinders. Manufacturers can analyze the performance of the cars based on the number of cylinders for improvements, while customers can decide how many cylinders they need in order to suit their needs. LIMITATIONS OF SCATTER PLOTS • There may be important patterns in higher dimensions which are barely recognized in it. • It becomes chaotic when the number of points, that is the number of data items, is too large. • In that case brushing can be applied to address this problem. Brushing aims interpretation by highlighting a particular n-dimensional subspace in the visualization, that is, the respective points of interested are colored or highlighted in each scatterplot in the matrix. Mixed Binary-Continuous Plots • We might be interested in knowing: • How some binary variable Y covariates with some continuous variable, X. • How some binary variable Y is different for different values of some binary variable X. • We could use a standard scatterplot but . . . A better approach is to add a smoothed representation of the relationship that describes the density" of the data at various points on the X-axis by adding a locally weighted regression (lowess) line. The lowess line represents something like the predicted value of the Y -axis variable at and around (conditional on) that value of X. The line gives an idea of the general shape" of the data. • When we have a continuous dependent variable and a binary independent variable, we need to adopt another approach. • Let's examine whether there is any observable differences in adult HIV/AIDS prevalence rates between Saharan and subSaharan Africa. We can see that sub-Saharan Africa: • Has, on average, higher HIV/AIDS rates. • Has greater variation in infection rates. Contour Plots, Surface Plots, and other 3-D Plots • Suppose we want to look at Muslim population percentage, HIV rates, and literacy rates all at once. • We could produce a contour plot { a representation of a three-dimensional graph in two dimensions}. • The contours tell us the level of HIV in those regions" defined by the contour lines. • In general, we see the highest level of HIV rates in countries with high literacy rates and low Muslim populations. • We could also use a three-dimensional scatter plot using R 's scatterplot3d package. LOTS OF VARIABLES • Things become much harder when we have four or more variables. You might decide to dichotomize or discretize one or more of your variables. • Suppose we want to know whether the relationship between the Muslim percentage of the population and HIV rates is moderated by the presence of civil wars and country size. • We could divide the countries into big and small and produce a four-way scatterplot. HIV/AIDS rate True False Muslim % of the population • 0 and 1 denote “little" and “big" countries respectively. • “TRUE" and “FALSE" denote values for civil war. • The negative HIV-Muslim population relationship holds for small countries but not large ones, and no appreciable differences between countries with internal conflict and those without.

PRESENTATION OF MULTIVARIATE DATA

Related documents

Products

Support

PRESENTATION OF MULTIVARIATE DATA

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib