Coding ------------------------------------------------------------------------ 2012-12-15 09:37:48 Graham Williams Replace all usage of cat() with message() or warning() so that messages can easily be supporessed by users. There are not many places I do this. Settings Menu File Encoding ------------------------------------------------------------------------ 2010-04-27 06:50:06 Graham Williams For Japanese (and probably others) only one type of encoding is active when loading a CSV file, and UTF-8 is the default. If you try loading a file with a different encoding (SJIS?) then you should manually change the default encoding: crv$csv.encoding="" We need an option in the Settings to change the default encoding. It would be nicer if we could check the file encoding through the R code, and then do it properly without the user knowing. Alternatively, we add it to the File Open dialog. ------------------------------------------------------------------------ Use Cairo Device Graphics ------------------------------------------------------------------------ 2011-09-11 09:48:07 Graham Williams Under Windows, using Cairo Device, plots, using multiple plots on the one page through layout() lose some elements from the plot. For now we default to not using Cairo Device on Windows. ------------------------------------------------------------------------ Help Menu ------------------------------------------------------------------------ 2010-02-13 08:11:18 Graham Williams, Tony Nolan Add a simple R command syntax example to each Rattle help text. ------------------------------------------------------------------------ Data Tab ------------------------------------------------------------------------ 2010-05-30 09:44:52 Graham Williams If the loaded CSV file has observations with the target as missing, then treat those observatoins as a scoring dataset, not to be included in the modelling at all. ------------------------------------------------------------------------ Variable selection Variable type ---------------------------------------------------------------------- 2010-02-13 08:02:46 Graham Williams, Robert Williams Turn the Data Type column into a combo box to allow variable types to change in situ. ---------------------------------------------------------------------- Variable role ---------------------------------------------------------------------- 2010-02-13 08:02:46 Graham Williams, Tony Nolan Replace the radio buttons with a single combo box. This will allow further choices for the roles. ---------------------------------------------------------------------- Explore Tab Distributions option ---------------------------------------------------------------------- 2008-01-31 12:14:03 Graham Williams Move to using ggplot2. The plots might be more standardised, modern, simpler to generate, and much more powerfull than plot and lattice. ---------------------------------------------------------------------- Interactive Tab GGobi Option ---------------------------------------------------------------------- 2010-10-07 07:08:36 Graham Williams Add radio buttons to select whether to display clusters, model results, etc: e.g;, g <- ggobi(crs$dataset) clustering <- hclust(dist(iris[,1:4]), method="average") glyph_colour(g[1]) <- cuttree(clustering, 3) ---------------------------------------------------------------------- Test Tab ------------------------------------------------------------------------ 2010-04-03 17:43:34 Graham Williams, Luke Lake Add "Categoric Tests: O Cross Tab" to then allow two categorics to be selected and then do a cross tab of them. I.e., replace Sample 1 and Sample 2 combo boxes with list of Categorics when this option is chosen. ------------------------------------------------------------------------ Transform Tab ------------------------------------------------------------------------ 2010-10-03 18:39:16 Graham Williams, Tony Nolan Add a Subset option to allow crs$datset to be subsetted in various ways and then to use the subset for all model building etc. ------------------------------------------------------------------------ Model Tab Fast and large regression and SVM ------------------------------------------------------------------------ 2010-05-31 06:31:29 Graham Williams New package available on CRAN: LiblineaR is a wrapper around the liblinear C/C++ library for machine learning: L1- or L2-regularized logistic regression, L1- or L2-regularized L2-loss support vector classification, L2-regularized L1-loss support vector classification and multi-class support vector classification. It is fast. ------------------------------------------------------------------------ Evaluate Tab Score option ------------------------------------------------------------------------ 2010-05-30 11:05:47 Graham Williams Bug: If the input data has many NAs for the target variable, then when scoring them (which is probably what is really wanted for these observations), the scores all come out as NA. ------------------------------------------------------------------------ Log Tab ------------------------------------------------------------------------ 2009-11-05 05:08:38 Graham Williams Periodically save the log tab to a backup file. ------------------------------------------------------------------------ Unsorted list: ------------------------------------------------------------------------ 091103 Nolan's group wihtout a categoric chose (Tony Nolan) 090929 Rattle View data with many columns has issues (Liyin Xue) 090928 Add in a tutorial mode - cf log tab (Tony Nolan) 090928 Extend printRFRules for regression (Michael Fogliani) 081222 Add CSV Separator and Header information to filechooser instead of tab 090512 Clustering via fuzzy c-means (e1071) algorithm (Iliya Georgiev) 090109 Add RVM (as cf SVM) (Gabriel Ibarra) 090109 Allow a SQL select to choose the data you want (Eberhard W. Lisse) 090109 Support PostgreSQL (RdbiPgSQL) and MySQL (RMySQL) (Eberhard W. Lisse) 090105 Use F1 for getting Help (Gabor Grothendieck) 090105 Implement Log as separate window (Gabor Grothendieck) 090124 Explore using igraph (Rado) 090304 Use Weights in correlation and cluster analysis (Al Jones) 080131 Add earth to Rattle (William Bert Craytor) 080528 Migrate rattle from libglade to GtkBuilder http://www.micahcarrick.com/12-24-2007/gtk-glade-tutorial-part-1.html http://www.micahcarrick.com/05-30-2008/gtk-builder-libglade-faq.html 080416 Log transform +ve/-ve (Linday Clayton) 080416 C-stat eval for interval target (Linda Clayton) 080729 Use rsq.rpart, plotcp, to plot errors with Errors button on Tree option 080716 Risk Charts for regression by using rank 080711 Replace Distributions with laticist/playwith (Felix Andrews) 080629 Capture transformations in PMML 080627 Allow categorics with hclust 080627 Auto normalise variables in clustering? 080627 Move to using pam since can handle missing and categorics 080626 Add a transpose option (Tony Nolan) 080531 Migrate from libglade to GtkBuilder 080529 Parallel Coords plot with target to colour lines (ggobi) 080529 Add ADTrees to Rattle 080526 Add transforms: log(n+1); asin(sqrt(x)); sqrt(x); x^(1/3); 1/x (Norrie) 080525 Use "norm" for multivariate normalisation: prelim.norm, em.nrom, imp.norm - stratify by target values 080520 Bug: Ada trees show -1/1 rather than 0/1? 0080518 Remove correlated + near constant columns in transform tab See http://cheminfo.informatics.indiana.edu/~rguha/code/R/ 080518 Implement knn regression as in http://cheminfo.informatics.indiana.edu/~rguha/code/R/ 080512 Fix scoring for multiple class classification 080513 Introduce Ensemble scoring to add scores together (see NetFlix). 080427 Fix Draw of decision tree for Regression 080512 Notice $ amounts and , in thousands and treat as numeric (Tony Nolan) 080521 Use dprep's ce.impute to impute: mean, median, knn 080521 Use dprep functionality for feature selection +++ 080521 Use dprep for outlier detection 080513 Add a R command console where paradigms were (Tony Nolan) 080517 Add a macros menu with canned R templates (Tony Nolan) 080511 Move to using guiDo from plotAndPlayGTK 080511 Move to using playwith from plotAndPlayGTK for plots 080511 Add support for scagnostics package 080504 Allow risk charts without a RISK identified 080502 Put PMML.COPYRIGHT into .Rattle 080502 Add report info to PMML export of GLM 080501 Fix cluster export to CSV to export the whole dataset 080429 Add a SQLiteDF data option 080427 Report pmml.nnet error to Zementis 080427 Fix Draw of decision tree for Regression 080427 On Score be sure to include the commands in log. 080426 Add "Transform" check button on Evaluate tab (active when Score and CSV) that will perform same transformations on the scoring dataset. (Tony) 080427 Fix export of PMML for rpart 080421 For assoc, if ID is not unique, set Baskets to checked 080422 Complete read.pmml function for rpart. 080419 Accumlate the transforms and apply to new CSV score file (Tom Neice) 080418 Change plots to be tabbed plots with second tab being parameters that lay behind the model whose performance is being plotted. 080418 Load rattle script files (.ras) for data, score, test: 080418 Allow txt score file (perhaps automatic if training data is txt) 080408 Transform Tab, "Filter" textbox for rows to keep 080330 After Transform refreshes list, keep same selections? 080318 Migrate all global variables into the crv$ structure - avoid warnings 080315 Add a Last Dataset/Project option (Tony Nolan) 080305 Implement silhouette plot for kmeans clusters 080203 Ed Cox: Add row selection to Select Tab (perhaps using "subset"). 080203 Ed Cox: Bug when using ODBC data source multiple times. 060921 Pareto, zipf (ginger.hpl.hp.com/shl/papers/ranking/ranking.html) (Stu) 070106 Transform: Implement mean/median from knn imputations (Daniel Medri) 071215 Display rsq.rpart - the r squared fit by splits. 071216 Add date conversion to Remap: as.Date(COLUMN, format="%d-%m-%Y") 071216 Rename binning to not conflict with sm's binning. 071205 Add in Randy Goebel's association rule visualisation (AI07) 071125 Use relaimpo to calculate relative importance of variables: calc.relimp(Adjusted ~ ., audit[,c(2,7,9:10,13)], type = c("lmg", "last", "first", "betasq", "pratt") ) 070422 Disable drag and drop from the file open dialogs (Ray Lindsay). 070406 For the cairo plot window, introduce keyboard shortcuts. 070325 Add a Defaults button to each model builder. 070130 Explore->Distr check buttons train/test/csv (csv from Evaluate tab) 070330 Add selection of clustering: diana, agnus, fanny, etc. 070421 Ensure the Help menu is complete. 061019 Numerous graphics extensions (Geir Nilsen) 070401 Add a File Type combo box to graphics save. 070514 FileChooser to also show file sizes (Tony Nolan) 070324 Implement dendrogram zoom using rgl (see plot3d example)? 070321 For the View textview, keep first row (headers) static as "edit()" does 070321 Implement own Edit textview to edit dataset, instead of ugly "edit()" 060703 Kmeans: If a target exists, generate a hot spots display. 070305 Add a "best k" finder using hclust (Daniele Medri) 060821 Explore->Distributions: Clear = clear all radio buttons 070317 Keep the cwd info local to Rattle rather than globally changing. 070317 Add arffname as a parameter to the rattle function call. 070109 Evaluate: add compare distributions radio button (Ray Lindsay) 070311 Replace radio buttons with icons (Graham Williams) 070319 Make the Explore:Summary:Find button do "Find Next" and remove Next. 070317 Make the Explore:Summary:Find search case insensitive. 070121 Each textview have default explanatory text 061001 Choice of Kernel for SVMs. 070319 Introduce a progress bar for all activities (Tony Nolan) 070114 Add textmining.R to Rattle 060723 Simplify RF rule sets 070220 Auto update feature(Tony Nolan) 070201 Investigate party for rattle (Daniel Medri) 070201 Invetigate rmpi for rattle (Daniel Medri) 070105 Implement knn model builder (Yale) 060001 Implement bagged boosted stumps (Jeremy Barnes) 070316 Implement "expert" settings for each model (Liyin Xue) ======================================================================== Description: Implement nnet (Yale) Proposer: DateProposed: Status: Effort: Required: Completed: 061230 This is not so satraightforward. I want this for two class prediction, so need a framework for building a model and then generating the predictions. Currently the following is not quite it.... crs$nnet <- nnet(as.factor(Adjusted) ~ ., data=......, size=1, na.action=na.omit)predict(crs$nnet, crs$dataset[crs$sample,c(1:6,9,12)], type="class") ======================================================================== Description: Optionally add cluster statistics. Proposer: Graham Williams DateProposed: 20 Oct 2006 Status: Done for kmeans. Do for hclust Effort: 2 hours Required: This would be something like the output of library(fpc) cluster.stats(dist(crs$dataset[crs$sample,c(2,7,9:10,12)]), crs$kmeans$cluster) Can it be pretty printed for the textview? Also, move to using buttons to generate plots and additional statistics. Currently to do - in hclust, enable the number of clusters and the statstics and discriminant plot. Whenever the number of clusters changes, update the internal crs variable to record this. And the buttons then work from this??? Perhaps. ======================================================================== Decsription: Improve hclust options Proposer: Stuart Hamilton DateProposed: 12 Oct 2006 Status: Suggested Effort: 2 hours Required: Explore hclust and then cutree options for hclust. Then also include the cluster.stats summary. crs$hclust <- hclust(dist(crs$dataset[crs$sample,c(2,7,9:10,12)]), "ave") cluster.stats(dist(crs$dataset[crs$sample,c(2,7,9:10,12)]), cutree(crs$hclust,10)) plotcluster(crs$dataset[crs$sample,c(2,7,9:10,12)], cutree(crs$hclust,10)) ======================================================================== Decsription: Move to using kmeansCBI etc for all clustering. Proposer: Graham Williams DateProposed: 20 Oct 2006 Status: Suggested Effort: 4 hours Required: Seems to provide a uniform interface and uniform output. But each model produces a lot more output. Do I really want to have all of that included in the model? crs will get large. ======================================================================== Decsription: Multiple evaluation plots for kmeans Proposer: Graham Williams DateProposed: 21 October 2006 Status: Suggested Effort: 2 hours Required: Completed: plotcluster in fpc can plot many different plots. Perhaps plot several in the one plot. Need to determine which ones would be useful. ######################################################################## IMPUTATION ------------------------------------------------------------------------ mix has mi.inference but did not look like it did what I was wanting. ------------------------------------------------------------------------ EMV has knn for imputation from knn. Works only on numeric matrix. ------------------------------------------------------------------------ mice looks promising but get errors: mp <- mice(crs$dataset) iter imp variable 1 1 Employment Employment.d.1 Occupation Occupation.d.1 AccountsError in nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE, : too many (1683) weights However, md.patter(crs$datset) gives very useful information: ID Age Education Marital Income Sex Deductions Hours Adjustment Adjusted 1859 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 40 1 1 1 1 1 1 1 1 1 1 97 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 Accounts Employment Occupation 1859 1 1 1 0 1 1 1 0 1 40 0 1 1 1 97 1 0 0 2 3 0 0 0 3 43 100 101 244 Here, 1859 entities have no missing values (i.e, a row of 1's and the final column is 0). 1 entity has a missing value for Occupation only (i.e, a row of 1's except under occupation where it is 0, and the final column is 1, indicating just 1 missing value). 40 entities have missing values for Accounts only, 90 have missing values for both Employment and Occupation, and 3 entities for all of Account, Employment and Occupation. This would be very useful information to include in Rattle. ------------------------------------------------------------------------ Amelia II http://gking.harvard.edu/amelia/ Is this really primarily for time series? An initial test indicates not much from this. Could not get categoricals working. output <- amelia(data=ds[,c(1,2,7,9,10,12,13)], m=1) ======================================================================== OLD Implement Marco's Transform/Cut/Factorise code? Version 2.2 General: Export functionality Data Tab: Access SQLite database Evaluate Tab: Implement distribution comparison Generate plots to compare distributions Model Type = "All" for all built models or use check boxes? Model tab Use tune to build the best model: rpart, randomForest, svm TODO General: Export each model to PMML, SQL, and standalone R script. Add an argument to rattle to load the csv file. Add a button to run an R script? (Tony Nolan) Add R commands to status line as they are run (Tony Nolan) Use rgl for sophisticated graphics? Data: Cluster: Multiple kmeans and plot to find best k (Enrico) Models: Neural Networks using netalg Bayesian using bayesm Plots: All plots into a Cairo tabbed window! ======================================================================== MAJOR Export button For Explore tab's Distribution option, export bunch of plots. Export of Model should generate PMML of the model. Export of Explore->Plot should generate PDF or PNGs of the plots. Can hier cluster rescale images (Linda) ======================================================================== POSSIBLE Data tab's ODBC option Allow username and password to be specified Allow a filename (abc.mdb or efg.xls) and use appropriate calls crs$odbc <- odbcConnectExcel("h:/sample.xls") crs$odbc <- odbcConnectAccess("h:/sample.mdb") Memory Have a look at biglm for lm on blocks of data - can do 23 million rows ======================================================================== Rewrite based on example code from Rcmdr as in QCUGUI Identify where Warnings occur and use suppressWarnings(code...) When execute, for each categorical, if num levels differ, report in TV Automatically identify bad boxplots (outliers) and do log. Graphs to compare distributions of the same variables between datasets Write article for Newsletter prcomp: NAs in data cause failure - temp soln is use na.omit(ds) and warn. For each textview, add initial text to explain the point (c.f. prcomp) ## 2-way tables work (heatmap over cells chi-square significant) (Rohan) ## Add t/z test and chi2 test in explore tab (Tony) ## Some solution to the blotches in the hierarchical clusters (Linda) ## Automatically add in an optimal line to risk plot? ## Add a Cancel/Stop button (Michael) ## Kernel optimisation ala KXEN. ## Model visualisation for SVMs ## Better report generation for direct inclusion in reports! ## 3D visualisation of SVM in R (Tony) ## Outputs - generate a report in Word (Tony) ## For each library command, check it is available, as with rpart. ## Need some checks to ensure the weight ends up numeric. ## Add a threshold value (e.g. 0.2) for use in the Evaluation charts (Ed Cox) ## For HierCor (and others) convert factors to integers (Tony) ## Allow Variable types to be changed e.g., factor to integer? (Tony) ## Save file with ID, Decision (use threshold), Probability (Ed Cox, Fuchun) ## Evaluate: Add OOB evaluations (Eugene) ## MODEL RF: Should sampsize be a proportion ## Seriation plot: add a mouse hover to provide extra info (Tony) ## Seriation plot: Click to plot all below the click (Tony) ## Seriation plot: cf Tony's terrain plot and explain ## Hierachical clustering to highlight targets (hot spots). ## Get AUC as a measure for ROC curves. ## Add seriation/colour denrogram www.lirmm.fr/~caraux/PermutMatrix/ (Stuart) ## Variables: For ratios SAS/EM quantile transform is good - Rohan ## Add a new Score tab to score a new dataset and save the results. (Fuchun) ## Explain lift better in the lift tooltip. ## For multiple targets, choose a different logistic regression method. ## Print any text view via a Print button. ## Add MARS (mda package). ## CLUSTER HIERCOR: Set text size based on # of branches? ## EVALUATE: Give error if Variables changed but model not regenerated. ## MISC: When changing tabs, warn if modified but not executed. ## BUG: Sample 10% not noticed until ENTER???? ## MISC: Run quit.rattle() when root window destroyed. ## DATA: Get header, rows, sep, and na.strings working ## MODEL RPART: Use print.rules.cmd string for log and eval ## MODEL: GBM: Implement and document new options ## EVALUATE RISK: If VARIABLE empty, then just do no Risk curve. ## EVALUATE: Error when test set has factor levels different to train ## EVALUATE: Use the text areas to describe the charts. ## EVALUATE: When Sampling, be sure to apply to only the sample. ## EVALUATE: Add grids to all plots ## INIT: Make filename the first active widget. TAB then to Execute. ## MODEL: Multiple models then optimise with best? ## Help -> Search ... brings up a text entry box ## Help -> Function ... brings up a text entry box ## DATA: In the file chooser replace Open with Select? ## Seriation plot: set point size to tiny (at least can save as PDF to zoom). ## VARIABLES: Could colour the text in Variable column according to role. ## MODEL note when sample has changed, model not updated and we eveluate ## SAMPLE: Use sample.split of caTools to partition test/train on ## target to maintain outcome ratio (and thus more likely to ## ensure there are samples from each class, unlike any old ## sample). This does not get around the problem with sampling ## not including all possible levels for the input variables so it ## won't help where the testset does not have all possible ## levels. RandomForest will still complain on the predict.
Shop at Amazon