Visualize your data with Facets

Imagine your data with Facets


The data is messed up. It is often tainted with unbalanced, incorrectly labeled, and wacky values to throw away your analysis and machine learning training.


The first step in cleaning your dataset is to understand where it needs to be cleaned. Today, I only have tools for work.




Imagine your data


Understanding your data is the first step to cleaning up your dataset machine learning. But it can be difficult to do, especially in any kind of generalized way.


An open source project from Google Research helps us look at statistics and slices into all sorts of etiquettes, which helps us see how our datasets are organized.

By allowing us to detect that the data may not look as we expected, aspects help reduce road accidents.
Let's see how it is. The team has a demo page on the web so you can use Fates from Chrome without installing anything. In addition, Aspects Visualization uses polymer web components supported by timescript code, so it can be easily embedded in Jupiter notebooks and webpages.


There are 2 parts to the sides: side overview and aspects dive. Let's look at each one in detail.





Side observation


Facial observation gives a good overview of your dataset. In previous episodes, we saw how tools like pandas help us gain knowledge of how our datasets are distributed. We can get a slightly upgraded view of this type of information using the Face Overview tool.


It divides data columns into large data sets, showing missing information such as missing percentages, minimum and maximum values, as well as facts such as facts, media, and standard deviations. It also has a column that shows the percentage of the value that is zero, which helps to capture the case where most of your values ​​are zero.




The device highlights a high percentage of zero, which is a good stomach check for some columns.

You can also view the distribution of data in training and test sets for each feature of your dataset. This is the best way to double-check that the test set has the same distribution in the training dataset.
Why?

Yes, in fact, it is best practice to do this level of analysis technology at least on your own, but I have definitely forgotten to check all these aspects of each column of my data. This tool helps you not to miss this important step, and highlights any abnormalities.

Face dive


Now let's look at Face Dive. This is where things get really fun. This allows you to keep more clarity on your dataset, and zoom all the way to see individual pieces of data!

You can database data by line and column, in any of the features of your dataset. It looks like when you're shopping online, say, shoes, and by different categories like filters, sizes, brands, and colors. Let’s look at an example of a deep dive in action to make it more concrete.

The interface is divided into main sections. The main area of ​​the center is a zoomable display of your data. In the left panel, you can change the management of your data to control facing, position, and color with various dropdown options. Straight down that center is a legend for performance. A row of data in the far right is a detailed view. You can click a row in the center data to see a detailed watch out of that particular data point.




Now let's see how it all comes together.


To do this we will use the "Census Dataset", a classic dataset extracted from the US Census of 1991 by Barry Baker. The goal is to determine if the family's annual income is above $ 50K based on various census facts.


face by row


We first divide the data by age range and sum the points based on the amount of data target. Here blue means less than K0, and red means more than K0.


Face by column


Now we can look at another feature of data, age breakdown. Do different numbers per week give different results within the age limit? Let's find out by facing the columns by the hour each week.


We see that there are 1-2 people working 1-2 hours per week, which is the result of children doing summer work. We can also see fewer and fewer people working 2 -––– hours as they get older, while the ––-– se hour segment remains relatively stable in the middle years of the chart.


Positioning


But it still doesn't give us a good idea of ​​what we're looking for. Let's try to change the position of the plot and get a more detailed view. We will switch positioning to 'scatter', and calculate by age only. I also go to select "hours per week" as my age sort order, to make it easier to see working hours in different age groups. Now we can see that the hours per week is increasing and decreasing between the charts on both sides.


You must continue to explore the data and find out what trends and relationships you can find. For example, you might encounter the country of origin, which shows that the data has been heavily scaled. It tells you that you want to collect more data points to make a more balanced dataset.


Load your data


If this is interesting, you're probably wondering how to load your dataset into fests. Here, you have 2 options.


You can either use the web interface and upload data and play with it in the browser, or you can install the library as a Jupiter notebook extension, using the instructions on the project's GitHub page.


Aspects are useful for peering into your dataset and seeing the relationships between the various features as well as whether there are missing or unexpected values ​​in your dataset.



Get Facets on GitHub: https://goo.gl/Xi8dTu

Play with Facets in the browser: https://goo.gl/fFLCEV

Comments