Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readdlm dictionary parsing default behavior #13

Open
jakebolewski opened this issue Jun 10, 2016 · 19 comments
Open

readdlm dictionary parsing default behavior #13

jakebolewski opened this issue Jun 10, 2016 · 19 comments
Labels
bug Something isn't working collections I/O

Comments

@jakebolewski
Copy link

jakebolewski commented Jun 10, 2016

julia> english = readdlm(IOBuffer(readstring("/usr/share/dict/words")),  ' ')
235886×1 Array{Any,2}:
 "A"
 "a"
 "aa"
 "aal"
 "aalii"
 "aam"
 "Aani"
 "aardvark"
 "aardwolf"
 "Aaron"
 "Aaronic"
 "Aaronical"
 "Aaronite"
 "Aaronitic"
 "Aaru"
 "Ab"
 "aba"
 "Ababdeh"
 "Ababua"
 "abac"
 "abaca"
 "abacate"
 "abacay"
 
 "zymoscope"
 "zymosimeter"
 "zymosis"
 "zymosterol"
 "zymosthenic"
 "zymotechnic"
 "zymotechnical"
 "zymotechnics"
 "zymotechny"
 "zymotic"
 "zymotically"
 "zymotize"
 "zymotoxic"
 "zymurgy"
 "Zyrenian"
 "Zyrian"
 "Zyryan"
 "zythem"
 "Zythia"
 "zythum"
 "Zyzomys"
 "Zyzzogeton"

julia> english_lower = map(lowercase, english);
ERROR: MethodError: no method matching lowercase(::Bool)
 in collect_to!(::Array{String,2}, ::Base.Generator{Array{Any,2},Base.#lowercase}, ::Int64, ::Int64) at ./array.jl:267
 in _collect(::Array{Any,2}, ::Base.Generator{Array{Any,2},Base.#lowercase}, ::Base.EltypeUnknown, ::Base.HasShape) at ./array.jl:249
 in map(::Function, ::Array{Any,2}) at ./abstractarray.jl:1131
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

julia> Set(typeof(x) for x in english)
Set(Any[Bool,SubString{String},Float64])

julia> english_str = map(strip, split(readstring("/usr/share/dict/words")));

julia> foreach(english_str) do x
       try
       parse(Float64, x)
       println(x)
       end
       end
infinity
Nan
nan
@andreasnoack
Copy link
Member

What did you expect from can_parse_float? The words that can be parsed as Float64s are already Float64s

julia> filter(t -> isa(t, Float64), english)
3-element Array{Any,1}:
 Inf
 NaN
 NaN

and the second argument in parse should be a string.

julia> parse(Float64, NaN)
ERROR: MethodError: no method matching parse(::Type{Float64}, ::Float64)
Closest candidates are:
  parse{T<:AbstractFloat}(::Type{T<:AbstractFloat}, ::AbstractString)
 in eval(::Module, ::Any) at ./boot.jl:225
 in macro expansion at ./REPL.jl:92 [inlined]
 in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

@jakebolewski
Copy link
Author

Yes, I am not thinking this early. I updated the original comment. I feel that this should return a list of strings by default (would be what I would expect) but this is debatable because Inf / NaN / true / false are in the dictionary.

@jakebolewski jakebolewski changed the title readdlm cannot parse the dictionary correctly by default readdlm dictionary parsing default behavior Jun 10, 2016
@malmaud
Copy link

malmaud commented Jun 10, 2016

I think every other DLM reader I've encountered will first check if a column is entirely boolean/numeric, and if not will just return strings for each cell. Returning a hybrid of both is weird.

@StefanKarpinski
Copy link
Member

Solution: delete readdlm.

@malmaud
Copy link

malmaud commented Jun 10, 2016

@ViralBShah
Copy link
Contributor

@tanmaykm said he is working on a better readdlm. And deleting it is not a solution, unless you have something better to offer.

@StefanKarpinski
Copy link
Member

How is deleting it not a solution? Both DataFrames and CSV.jl already offer better solutions to this.

@tkelman
Copy link

tkelman commented Jun 10, 2016

Even if the code that's currently in base is going to be improved, there's not much reason for it to remain in this repository and coupled to the release schedule of the core language and compiler.

@StefanKarpinski
Copy link
Member

💯 CSV parsing is complicated enough that it should be moved out of the base language.

@ViralBShah
Copy link
Contributor

ViralBShah commented Jun 10, 2016

Yes, CSV parsing is complicated, but dlm parsing is not. The current Julia repo has a bunch of other things that make things convenient for users of such systems - and I don't see why reading simple delimited files is the thing that has to be booted. Why do I have to install a package to load a simple file?

@StefanKarpinski
Copy link
Member

This implementation is a nightmare and it's awful to maintain when we change things in Base.

@StefanKarpinski
Copy link
Member

The fact that we have readdlm in base leads to also having readcsv in base, which leads to users, naïvely (the fools!) actually trying to use it to read CSV files, only to be told, "don't do that, you should use DataFrames or CSV". That is a ludicrous situation.

@JeffBezanson
Copy link
Contributor

JeffBezanson commented Jun 10, 2016 via email

@ViralBShah
Copy link
Contributor

readcsv is only 2 lines, and I am fine with removing them to avoid the confusion.

readcsv(io; opts...)          = readdlm(io, ','; opts...)
readcsv(io, T::Type; opts...) = readdlm(io, ',', T; opts...)

@ViralBShah
Copy link
Contributor

The implementation is a nightmare for good reason. I think it will simplify over time as the compiler gets better, and by being strict about the files that it accepts.

@nalimilan
Copy link
Member

+1 for removing. Apart from threads by people complaining about poor performance and lack of options of this function, Julia has been criticized for having a too big standard library. readdlm is a typical case as it's not even in a submodule which could eventually become a separate package.

@JeffBezanson
Copy link
Contributor

Yes, I think we should deprecate readcsv right away, and plan to move readdlm into the "standard library" when that process happens, hopefully early in the 0.6 cycle.

@jakebolewski
Copy link
Author

jakebolewski commented Jun 15, 2016

reading in simple homogeneous (typed) text files is really handy esp. for testing so delegating this job to CSV.jl or similar package seems like overkill. I think that readdlm should be similar to Numpy's loadtxt. It handles the simple cases reliably by specifying the input type up front and punting on any input that is hard. I would rather have something simple that works reliably than something super complicated and magical that does not.

@StefanKarpinski
Copy link
Member

I would be fine with having something simple, fast and reliable in Base; that is not what we have atm.

@JeffBezanson JeffBezanson added the bug Something isn't working label Aug 23, 2017
@ViralBShah ViralBShah transferred this issue from JuliaLang/julia Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working collections I/O
Projects
None yet
Development

No branches or pull requests

9 participants