配列データをWekaのinstanceに変換する方法 - (主に)プログラミングのメモ

以前，2010-08-31 - 仕事関連のメモにて Weka を用いた k-means 法の実装を述べた．しかし，次元が大きくなると ARFF ファイルを準備するのが大変だし，そもそも既存のプログラムにより生成されたデータを（わざわざ）ファイルを介して Weka に渡すのもおかしな話だということで，Java の基本機能により読み込んだデータからインスタンス集合を生成するプログラムを書いた．

// 配列データを Weka のインスタンスとして k-means 法を適用する．
/*
iris.txt の形式は以下のとおり（属性名を削ったデータ）：
----------------------------
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
....................
----------------------------
 */

// [コンパイル＆実行]
// javac -cp weka.jar WekaKmeans.java
// java -cp weka.jar:. WekaKmeans

import java.io.*;
import weka.core.*;
import weka.clusterers.*;

public class WekaKmeans {
    public static void main(String[] args) throws Exception {

	//属性名の設定
	FastVector attributes = new FastVector();
	attributes.addElement(new Attribute("sepallength"));
	attributes.addElement(new Attribute("sepalwidth"));
	attributes.addElement(new Attribute("petallength"));
	attributes.addElement(new Attribute("petalwidth"));

	//インスタンス集合の定義
	Instances dataClusterer = new Instances("iris", attributes, 0);

	/// iris.txt からのデータ読み込み ＆ インスタンス集合の生成（ここから） ///
	try {
	    BufferedReader fin = new BufferedReader(
						    new FileReader("iris.txt"));
	    String s;
	    while((s = fin.readLine()) != null){
		String[] ss = s.split(",");

		//インスタンス１個分のデータを配列に入れておいて・・・
		double[] data = new double[4];
		for(int i = 0; i < 4; i++) data[i] = Double.parseDouble(ss[i]);

		//インスタンスを作り・・・
		Instance instance = new Instance(1.0, data);

		//インスタンス集合に追加（登録）する
		dataClusterer.add(instance);
	    }
	    
	    fin.close();
	} catch(Exception e){
	    System.err.println("データ読み込み中にエラー");
	    System.exit(1);
	}
	/// iris.txt からのデータ読み込み ＆ インスタンス集合の生成（ここまで） ///

	
	////////// clusterer（クラスタリングモデル）を作る //////////
	// クラスタリング手法として k-means 法を用いる
	SimpleKMeans clusterer = new SimpleKMeans();
	clusterer.setNumClusters(3);	// クラスタ数の設定
	clusterer.buildClusterer(dataClusterer); //学習データを与えて clusterer を構築

	////////// clusterer（クラスタリングモデル）の評価 //////////
	ClusterEvaluation eval = new ClusterEvaluation();
	eval.setClusterer(clusterer);  // 評価したい clusterer を設定
	eval.evaluateClusterer(dataClusterer);  // テストデータとして元データを与える

	////////// clusterer の評価結果（サマリ）を出力 //////////
	System.out.println(eval.clusterResultsToString());

	// クラスタリングの結果として割り当てられたクラス番号を得る
	double[] assignment = eval.getClusterAssignments();
	for(int i = 0; i < assignment.length; i++){
	    System.out.print((int)assignment[i] + " ");
	}
    }
}

実行結果は以下のとおり．ARFFファイルからデータを読み込んだ場合と同じ結果である（当然か）．

kMeans
======

Number of iterations: 6
Within cluster sum of squared errors: 6.998114004826762
Missing values globally replaced with mean/mode

Cluster centroids:
                           Cluster#
Attribute      Full Data          0          1          2
                   (150)       (61)       (50)       (39)
=========================================================
sepallength       5.8433     5.8885      5.006     6.8462
sepalwidth         3.054     2.7377      3.418     3.0821
petallength       3.7587     4.3967      1.464     5.7026
petalwidth        1.1987      1.418      0.244     2.0795


Clustered Instances

0       61 ( 41%)
1       50 ( 33%)
2       39 ( 26%)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2 2 2 0 2 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 0 2 2 2 0 2 2 2 0 2 2 2 0 2 2 0