WEKA Steps for Loading Data
0 |
Steps for Loading Data into WEKA
ARFF format consists of three parts: @RELATION, @ATTRIBUTE and @DATA.
@RELATION “SPACE” name
@ATTRIBUTE “SPACE” descriptor name “SPACE” data type (numeric, nominal …)
@DATA: numbers (integer or real) or strings
General rules for ARFF file can be found here
https://www.cs.waikato.ac.nz/ml/weka/arff.html
(Search “ARFF File Format”)
In Excel:
- File 1: the format in this file is three columns for: “@ATTRIBUTE”, the descriptors’ names, and “NUMERIC”
- Open the txt data file in Excel. Make sure you are searching from “All Files”.
Figure 1, Opening your data file
- When Text Import Wizard prompts, choose Delimited and click on Next (step 1), check the Tab box (step 2) and click on Finish. (Leave the rest as default unless it is necessary to change.)
Figure 2, Step 1
Figure 3, Step 2
- Create three blank columns.
- Copy the descriptors’ names (they are at the first row of your data file, normally) and paste them in the vertical form by using “Transpose” pasting option to the second column.
- Make sure the cell format is Text before the next step.
Figure 4, Cell format
- For the first and third columns: make equal number of rows as the second column has for “@ATTRIBUTE” and “NUMERIC”, respectively.
- Example:
Figure 5, the three columns
- File 2: Here we separate the @DATA part (only numbers) of data.
- Keep only the numbers needed and delete everything else.
- Save it in CSV (comma delimited) format. (Click YES when asks “Some features in your workbook … Do you want to keep using that format?”)
Figure 6, Save as CSV
In Notepad++:
- File A:
- In the first row, create two columns: @RELATION and a title for the relation name (which are just separated by a space).
- (For aesthetics: leave the second row blank.)
- Copy the three columns in File 1 of Excel and paste into row3.
- Example:
Figure 7, start of the File A
- Depending on your data, the last @ATTRIBUTE row will need to be the response variable AKA the thing you are trying to predict. (CATEGORICAL may require this, NUMERIC may not)
- After all ATTRIBUTE information has been pasted leave 1 row blank (For aesthetics)
- After blank row Type “@DATA”.
- (For aesthetics: For the next row after “@DATA”, leave it blank.)
- Example:
Figure 8, middle of File A
Figure 9, File 2 open with NotePad++
- Copy and paste all the data to File A.
- Already returned to File A:
- Save the file as “arff” format (by adding “.arff” at the end of the file name).
The file is ready to open and run in WEKA (yay).
PS: (1) “%” would give you error if included in the descriptors’ names, specifically such as “%N” or “%O”, after “@ATTRIBUTE”. The error would occur because whatever after “%” is considered as comment, then in WEKA, it would be interpreted as a missing information for the descriptor name and data type. Simply remove “%” would avoid errors. (2)Also make sure the number of descriptors matches with the number of data in the “@DATA” section. (If you have 10 lines of @ATTRIBUTE + descriptors’ names, there should have 10 numbers in each line in @DATA part in the Notepad++.