where clause causing extra iterations of preceding clauses

Jun 11, 2014 at 9:09 PM
I am experiencing a case in which a "where" clause is causing repetition of the clauses that precede it. I am opening a series of files and filtering on failure-to-open and putting the results in a container. A simplified, standalone version of the code follows:
#include <stdio.h>
#include <iostream>
#include <vector>
#include "cpplinq.hpp"

using namespace cpplinq;

int main( int argc, char* argv[] )
{
   char *pFilepaths[] = { "C:\\temp\\a.txt", "C:\\temp\\b.txt" };
   std::vector<std::pair<std::string,FILE*>> vFiles = from_array( pFilepaths )
      >> select( []( const std::string& filePath ) -> std::pair<std::string, FILE*>
         { 
            FILE* pFile = fopen( filePath.c_str(), "w" );
            std::cout << "select: path = " << filePath << ", filePtr = " << pFile << std::endl;
            return std::make_pair( filePath, pFile ); 
         })
      >> where( []( const std::pair<std::string, FILE*>& filePathPtrPair )->bool
         { 
            std::cout << "where: filePath = " << filePathPtrPair.first << ", filePtr = " << filePathPtrPair.second << std::endl;
            if( filePathPtrPair.second == nullptr )
            { 
               std::cout << "Unable to open file ID: " << filePathPtrPair.first << std::endl;
               return false;
            }
            return true;
         })
      >> to_vector();

   for( auto iFilePathPtrPair = vFiles.begin(); iFilePathPtrPair != vFiles.end(); ++iFilePathPtrPair )
   {
      std::cout << "vector: " << iFilePathPtrPair->first << ", filePtr = " << iFilePathPtrPair->second << std::endl;
   }
   return 0;
}
The output is as follows:
select: path = C:\temp\a.txt, filePtr = 00000000579CA6F0
where: filePath = C:\temp\a.txt, filePtr = 00000000579CA6F0
select: path = C:\temp\a.txt, filePtr = 00000000579CA720 ** second time thru on file a.txt
select: path = C:\temp\b.txt, filePtr = 00000000579CA750
where: filePath = C:\temp\b.txt, filePtr = 00000000579CA750
select: path = C:\temp\b.txt, filePtr = 00000000579CA780 ** second time thru on file b.txt
vector: C:\temp\a.txt, filePtr = 00000000579CA720
vector: C:\temp\b.txt, filePtr = 00000000579CA780

The bolded lines are the ones that don't seem appropriate. It seems to be repeating the select clause after performing the where clause. By looking at what it put in the vector, I can see that it is putting the results of the extra iterations, not the original ones.

For now, I have disabled the "where" clause and when I get some time I will dig into the back-end operation to see if I can figure out why this is happening.

Any feedback into why this is happening would be greatly appreciated.

Thanks,

-Todd
Coordinator
Jun 11, 2014 at 10:12 PM
Hi.

That does seem like the wrong behavior. Thanks for pointing it out. I will look into it when I have the time.

Mårten
Jun 13, 2014 at 2:16 PM
Edited Jun 13, 2014 at 2:41 PM
I've stepped through the code and I now understand what is happening. The select predicate (via select_range::front) is getting called by both where_range::next and where_range::front as shown in the following call history:
to_vector_builder.build()
   where_range.next()
      select_range.next()
         from_range.next()
      select_range.front()
         from_range.front()
         select_range.predicate() // select predicate called the first time
      where_range.predicate()
   where_range.front()
      select_range.front()
          from_range.front()
          select_range.predicate() // select predicate called the second time
Unfortunately, knowledge of what is going on isn't immediately translating to knowledge of what to do about it. The fundamental issue is that "where" needs to run the select to get the input to its own predicate, but it doesn't store the result, forcing the aggregation to recompute it. I would expect all of the predicate-based filtering and partitioning functions (where, skip_while, take_while, etc.) to have the same problem (technically, any case where foo::next calls range.front).

-Todd
Jun 13, 2014 at 2:34 PM
Edited Jun 13, 2014 at 2:35 PM
The following changes to where_range did the trick. Unfortunately, this requires where_range to store a copy of the value from range.front().
        template<typename TRange, typename TPredicate>
        struct where_range : base_range
        {
            ...
            detail::opt<value_type> front_value;

            CPPLINQ_INLINEMETHOD return_type front () const
            {
                return front_value.get();
            }

            CPPLINQ_INLINEMETHOD bool next ()
            {
                while (range.next ())
                {
                    front_value = to_opt(range.front());
                    if (predicate (front_value.get()))
                    {
                        return true;
                    }
                }

                return false;
            }
        };
I stored the value in a detail::opt so as to not require that value_type be default constructable.

-Todd
Jun 16, 2014 at 6:11 PM
Edited Jun 16, 2014 at 6:14 PM
Thinking about it some more, it seems that caching the return value should be in select, where it gets computed. This frees the other ranges that don't modify it from needing to cache it. I reverted my change to where_range and put the caching into select_range. Unfortunately, I am running into a MSVC bug when putting the cached value into a detail::opt. So, I have implemented it two ways, with preprocessor switches to control which gets compiled. The non-opt version requires that value_type be default constructable, and will default construct it when select_range is constructed.

select_range becomes:
        template<typename TRange, typename TPredicate>
        struct select_range : base_range
        {
            static typename TRange::value_type get_source ();
            static          TPredicate get_predicate ();


            typedef        decltype (get_predicate ()(get_source ()))   raw_value_type ;
            typedef        typename cleanup_type<raw_value_type>::type  value_type     ;
            typedef                 value_type                          return_type    ;
            enum
            {
                returns_reference   = 0   ,
            };

            typedef                 select_range<TRange, TPredicate>    this_type      ;
            typedef                 TRange                              range_type     ;
            typedef                 TPredicate                          predicate_type ;

            range_type              range       ;
            predicate_type          predicate   ;
#ifdef _MSC_VER
            mutable bool            needs_computed;
            mutable return_type     cached_value;
#else
            mutable detail::opt<value_type> cached_value;
#endif

            CPPLINQ_INLINEMETHOD select_range (
                    range_type      range
                ,   predicate_type  predicate
                ) CPPLINQ_NOEXCEPT
                :   range       (std::move (range))
                ,   predicate   (std::move (predicate))
#ifdef _MSC_VER
                ,   needs_computed (true)
#endif
            {
            }

            CPPLINQ_INLINEMETHOD select_range (select_range const & v)
                :   range       (v.range)
                ,   predicate   (v.predicate)
#ifdef _MSC_VER
                ,   needs_computed (true)
#endif
            {
            }

            CPPLINQ_INLINEMETHOD select_range (select_range && v) CPPLINQ_NOEXCEPT
                :   range       (std::move (v.range))
                ,   predicate   (std::move (v.predicate))
#ifdef _MSC_VER
                ,   needs_computed (true)
#endif
            {
            }

            template<typename TRangeBuilder>
            CPPLINQ_INLINEMETHOD typename get_builtup_type<TRangeBuilder, this_type>::type operator>>(TRangeBuilder range_builder) const
            {
                return range_builder.build (*this);
            }

            CPPLINQ_INLINEMETHOD return_type front () const
            {
#ifdef _MSC_VER
                if (needs_computed)
                {
                    cached_value = predicate (range.front ());
                    needs_computed = false;
                }
                return cached_value;
#else
                if (cached_value.has_value() == false)
                {
                    cached_value = predicate (range.front ());
                }
                return cached_value.get();
#endif
            }

            CPPLINQ_INLINEMETHOD bool next ()
            {
#ifdef _MSC_VER
                needs_computed = true;
#else
                cached_value.clear ();
#endif
                return range.next ();
            }
        };
I haven't compiled the non-MSVC version, so it is possible it has issues. I am getting my CentOS box upgraded to a recent version of GCC and will give it a shot once that happens.

-Todd
Coordinator
Jun 16, 2014 at 8:01 PM
Your analyzis seems correct.

I agree it the caching should most likely be in select. I do however feel a bit queasy about automatic caching in select, perhaps I am overly cautious.

Another idea is to introduce a new caching operator something like this
auto q = from (v) >> select (...) >> cache () >> where (...) >> to_vector ()
If you don't mind you can prepare a pull-request, I can test run it on my linux VM on other compilers

I don't think you need needs_computed, the opt type keeps track if it's a assigned a value: if (cached_value) {// enters if cached_value is assigned}
Coordinator
Jun 16, 2014 at 8:06 PM
Another small thing,

as you are caching the value in select now then return_type and returns_reference should be updated to a ref instead of value:
typedef                 value_type const &                return_type    ;

enum
{
    returns_reference   = 1  ,
};
That way at least we will avoid a copying the value and will only invoke the move .ctor when caching... with that change I feeling more relaxed over the change.

However, I will need to do some performance measurement as well.
Coordinator
Jun 16, 2014 at 8:06 PM
Note to self; Perhaps select_many needs caching too....
Jun 16, 2014 at 9:30 PM
I agree that it can return a reference. I didn't feel confident enough with the framework to make that change yet.

I understand how the opt type works. The needs_computed is in the #ifdef _MSC_VER block and is needed because MSVC wouldn't handle the opt type in this case. I implemented it with the opt first ,but MSVC didn't recognize the value_type when used in opt<value_type> and claimed the type was empty (not void, just... empty). So, I effectively implemented my own opt as a bool and a value_type.

I had thought that select_many may need caching too.
Coordinator
Jun 18, 2014 at 6:36 AM
My hope is that I have time to look at this issue over the weekend
Jun 18, 2014 at 9:54 PM
Caching in select makes select do the right thing, but perhaps not perform as well as it could. Not caching in select seems to be an optimization. Perhaps have the default select perform caching and have a select_ref that doesn't cache and returns by reference.

I have GCC 4.7.2 on my Linux box now and cpplinq compiles cleanly for me, including with the detail::opt as the cache in select.

I have created an ancillary hpp file that implements several additional higher-level-functions (to_set, to_multimap, a to_map that has no predicate and inserts pairs from the previous step, first_index, try_first and try_first_index) and several convenience predicates (non_null for use with where and as_static_type and as_dynamic_type for performing casting via select: >> select (as_static_type<int>()). I will make them available when I have performed more testing. There are several others that I intend to implement.

I have also fixed a bug in max. The std::numeric_limits<T>::min doesn't perform as expected when T = float or T = double. In these cases it returns the smallest positive value rather than the largest negative value. The fix involves defining a new templated class called linqlimits and defining a min operator for it. The unspecialized one calls numeric_limits::min. There are specialized versions for float and double that call -numeric_limits::max. I then use linqlimits::min rather than using numeric_limits::min to initialize the current value.
Coordinator
Jul 18, 2014 at 9:44 AM
The extra iterations (or predicate applications) should now be resolved with commit: 073a3877eb7c.

The numeric_limits::min issue should be resolved in commit: ec3c1bbb2f53

Other fixes we can do in separate commits.