WEKA Steps for Loading Data: Difference between revisions
(Imported from text file) |
(Revised formatting) |
||
Line 1: | Line 1: | ||
__TOC__ | __TOC__ | ||
====== Steps for Loading Data into WEKA ====== | |||
ARFF format consists of three parts: '''@RELATION''', '''@ATTRIBUTE''' and '''@DATA'''. | |||
* @RELATION name | |||
* @ATTRIBUTE descriptor_name data_type (numeric, nominal …) | |||
* @DATA: numbers (integer or real) or strings<span id="_GoBack"></span>General rules for ARFF file can be found here: https://www.cs.waikato.ac.nz/ml/weka/arff.html | |||
====== In Excel: ====== | |||
* File 1: the format in this file is three columns for: “'''@ATTRIBUTE'''”, the descriptors’ names, and “'''NUMERIC'''” | |||
In Excel: | |||
* File 1: the format in this file is three columns for: “@ | |||
** Open the txt data file in Excel. Make sure you are searching from “All Files”. | ** Open the txt data file in Excel. Make sure you are searching from “All Files”. | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_4ec5c9075dd9cb51.png|512x288px]] [[File:WEKA_Steps_for_Loading_Data_HTML_ae35e05dbfaa6f7c.gif|31x38px|Shape1]] | [[File:WEKA_Steps_for_Loading_Data_HTML_4ec5c9075dd9cb51.png|512x288px]] [[File:WEKA_Steps_for_Loading_Data_HTML_ae35e05dbfaa6f7c.gif|31x38px|Shape1]] | ||
''Figure | ''Figure 1, Opening your data file'' | ||
* When '''Text Import Wizard''' prompts, choose '''Delimited and''' click on '''Next''' ('''step 1'''), check the '''Tab''' box ('''step 2''') and click on '''Finish'''. (Leave the rest as default unless it is necessary to change.) | * When '''Text Import Wizard''' prompts, choose '''Delimited and''' click on '''Next''' ('''step 1'''), check the '''Tab''' box ('''step 2''') and click on '''Finish'''. (Leave the rest as default unless it is necessary to change.) | ||
Line 55: | Line 20: | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_d288ed8722be4fdc.png|393x300px]] [[File:WEKA_Steps_for_Loading_Data_HTML_369e56d90bf80141.gif|32x31px|Shape3]] [[File:WEKA_Steps_for_Loading_Data_HTML_bef7f76831abec8b.gif|14x32px|Shape2]] | [[File:WEKA_Steps_for_Loading_Data_HTML_d288ed8722be4fdc.png|393x300px]] [[File:WEKA_Steps_for_Loading_Data_HTML_369e56d90bf80141.gif|32x31px|Shape3]] [[File:WEKA_Steps_for_Loading_Data_HTML_bef7f76831abec8b.gif|14x32px|Shape2]] | ||
''Figure | ''Figure 2, Step 1'' | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_cd41ce313e7118b1.png|379x288px]] [[File:WEKA_Steps_for_Loading_Data_HTML_859149fac76acd64.gif|30x19px|Shape5]] [[File:WEKA_Steps_for_Loading_Data_HTML_6092b35113c6658c.gif|19x33px|Shape4]] | [[File:WEKA_Steps_for_Loading_Data_HTML_cd41ce313e7118b1.png|379x288px]] [[File:WEKA_Steps_for_Loading_Data_HTML_859149fac76acd64.gif|30x19px|Shape5]] [[File:WEKA_Steps_for_Loading_Data_HTML_6092b35113c6658c.gif|19x33px|Shape4]] | ||
''Figure | ''Figure 3, Step 2'' | ||
* Create three blank columns. | * Create three blank columns. | ||
Line 67: | Line 32: | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_5e70e797a5c2cba0.png|313x156px]] [[File:WEKA_Steps_for_Loading_Data_HTML_634c3e41e8a46d53.gif|59x34px|Shape6]] | [[File:WEKA_Steps_for_Loading_Data_HTML_5e70e797a5c2cba0.png|313x156px]] [[File:WEKA_Steps_for_Loading_Data_HTML_634c3e41e8a46d53.gif|59x34px|Shape6]] | ||
''Figure | ''Figure 4, Cell format'' | ||
* For the first and third columns: make equal number of rows as the second column has for “@ATTRIBUTE” and “NUMERIC”, respectively. | * For the first and third columns: make equal number of rows as the second column has for “@ATTRIBUTE” and “NUMERIC”, respectively. | ||
Line 74: | Line 39: | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_e8fa2497f6fc4487.png|198x337px]] | [[File:WEKA_Steps_for_Loading_Data_HTML_e8fa2497f6fc4487.png|198x337px]] | ||
''Figure | ''Figure 5, the three columns'' | ||
* File 2: Here we separate the @DATA part (only numbers) of data. | * File 2: Here we separate the @DATA part (only numbers) of data. | ||
** Keep only the numbers needed and delete everything else. | ** Keep only the numbers needed and delete everything else. | ||
** Save it in '''CSV (comma delimited)''' format. ('''Click YES''' when asks “Some features in your workbook … Do you want to keep using that format?”) | ** Save it in '''CSV (comma delimited)''' format. ('''Click YES''' when asks “Some features in your workbook … Do you want to keep using that format?”) | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_d4bc2328d2c3083f.png|279x64px]] [[File:WEKA_Steps_for_Loading_Data_HTML_4fc01181394c3a93.gif|43x28px|Shape7]] | [[File:WEKA_Steps_for_Loading_Data_HTML_d4bc2328d2c3083f.png|279x64px]] [[File:WEKA_Steps_for_Loading_Data_HTML_4fc01181394c3a93.gif|43x28px|Shape7]] | ||
''Figure | ''Figure 6, Save as CSV'' | ||
====== In Notepad++: ====== | |||
* File A: | * File A: | ||
** In the first row, create two columns: @RELATION and a title for the relation name (which are just separated by a space). | ** In the first row, create two columns: @RELATION and a title for the relation name (which are just separated by a space). | ||
Line 103: | Line 57: | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_1f0af9e857ead597.png|247x204px]] | [[File:WEKA_Steps_for_Loading_Data_HTML_1f0af9e857ead597.png|247x204px]] | ||
''Figure | ''Figure 7, start of the File A'' | ||
* Depending on your data, the last @ATTRIBUTE row will need to be the response variable AKA the thing you are trying to predict. (CATEGORICAL may require this, NUMERIC may not | * File A (cont): | ||
* After all ATTRIBUTE information has been pasted leave 1 row blank ''(For aesthetics'') | ** Depending on your data, the last @ATTRIBUTE row will need to be the response variable AKA the thing you are trying to predict. (CATEGORICAL may require this, NUMERIC may not | ||
* After blank row Type “@DATA”. | ** After all ATTRIBUTE information has been pasted leave 1 row blank ''(For aesthetics'') | ||
* ''(For aesthetics:'' For the next row after “@DATA”, leave it blank.) | ** After blank row Type “@DATA”. | ||
* Example: | ** ''(For aesthetics:'' For the next row after “@DATA”, leave it blank.) | ||
** Example: | |||
[[File:WEKA_Steps_for_Loading_Data_HTML_ca4125bd73bd6170.png|460x172px]] | |||
''Figure 8, middle of File A'' | |||
* File B: | * File B: | ||
** Open File 2 from Notepad++ (and you should see the data are separated by commas). | ** Open File 2 from Notepad++ (and you should see the data are separated by commas). | ||
** Example: [[File:WEKA_Steps_for_Loading_Data_HTML_964b046e1656a7f.png|555x225px]] | ** Example: | ||
[[File:WEKA_Steps_for_Loading_Data_HTML_964b046e1656a7f.png|555x225px]] | |||
''Figure <span style="background: #c0c0c0">9</span>, File 2 open with NotePad++'' | ''Figure <span style="background: #c0c0c0">9</span>, File 2 open with NotePad++'' | ||
* Copy and paste all the data to File A. | * Copy and paste all the data to File A. | ||
* Already returned to File A: | * Already returned to File A: | ||
** Save the file as “arff” format (by adding “.arff” at the end of the file name). | ** Save the file as “arff” format (by adding “.arff” at the end of the file name). | ||
'''The file is ready to open and run in WEKA (yay).''' | '''The file is ready to open and run in WEKA (yay).''' | ||
PS: (1) “%” would give you error if included in the descriptors’ names, specifically such as '''“%N” or “%O”''', after “@ATTRIBUTE”. The error would occur because whatever after “%” is considered as comment, then in WEKA, it would be interpreted as a missing information for the descriptor name and data type. Simply remove “%” would avoid errors. (2)Also make sure the number of descriptors matches with the number of data in the “@DATA” section. (If you have 10 lines of @ATTRIBUTE + descriptors’ names, there should have 10 numbers in each line in @DATA part in the Notepad++.)<br /> | |||
<br /> |
Latest revision as of 20:39, 26 September 2022
Steps for Loading Data into WEKA
ARFF format consists of three parts: @RELATION, @ATTRIBUTE and @DATA.
- @RELATION name
- @ATTRIBUTE descriptor_name data_type (numeric, nominal …)
- @DATA: numbers (integer or real) or stringsGeneral rules for ARFF file can be found here: https://www.cs.waikato.ac.nz/ml/weka/arff.html
In Excel:
- File 1: the format in this file is three columns for: “@ATTRIBUTE”, the descriptors’ names, and “NUMERIC”
- Open the txt data file in Excel. Make sure you are searching from “All Files”.
Figure 1, Opening your data file
- When Text Import Wizard prompts, choose Delimited and click on Next (step 1), check the Tab box (step 2) and click on Finish. (Leave the rest as default unless it is necessary to change.)
Figure 2, Step 1
Figure 3, Step 2
- Create three blank columns.
- Copy the descriptors’ names (they are at the first row of your data file, normally) and paste them in the vertical form by using “Transpose” pasting option to the second column.
- Make sure the cell format is Text before the next step.
Figure 4, Cell format
- For the first and third columns: make equal number of rows as the second column has for “@ATTRIBUTE” and “NUMERIC”, respectively.
- Example:
Figure 5, the three columns
- File 2: Here we separate the @DATA part (only numbers) of data.
- Keep only the numbers needed and delete everything else.
- Save it in CSV (comma delimited) format. (Click YES when asks “Some features in your workbook … Do you want to keep using that format?”)
Figure 6, Save as CSV
In Notepad++:
- File A:
- In the first row, create two columns: @RELATION and a title for the relation name (which are just separated by a space).
- (For aesthetics: leave the second row blank.)
- Copy the three columns in File 1 of Excel and paste into row3.
- Example:
Figure 7, start of the File A
- File A (cont):
- Depending on your data, the last @ATTRIBUTE row will need to be the response variable AKA the thing you are trying to predict. (CATEGORICAL may require this, NUMERIC may not
- After all ATTRIBUTE information has been pasted leave 1 row blank (For aesthetics)
- After blank row Type “@DATA”.
- (For aesthetics: For the next row after “@DATA”, leave it blank.)
- Example:
Figure 8, middle of File A
- File B:
- Open File 2 from Notepad++ (and you should see the data are separated by commas).
- Example:
Figure 9, File 2 open with NotePad++
- Copy and paste all the data to File A.
- Already returned to File A:
- Save the file as “arff” format (by adding “.arff” at the end of the file name).
The file is ready to open and run in WEKA (yay).
PS: (1) “%” would give you error if included in the descriptors’ names, specifically such as “%N” or “%O”, after “@ATTRIBUTE”. The error would occur because whatever after “%” is considered as comment, then in WEKA, it would be interpreted as a missing information for the descriptor name and data type. Simply remove “%” would avoid errors. (2)Also make sure the number of descriptors matches with the number of data in the “@DATA” section. (If you have 10 lines of @ATTRIBUTE + descriptors’ names, there should have 10 numbers in each line in @DATA part in the Notepad++.)