SelectMany的原碼探險

2018 iT 邦幫忙鐵人賽

DAY 13

Software Development

深入探索LINQ系列第 13 篇

2018鐵人賽 linq .net core

Peter Chen

2018-01-01 14:42:27

3275 瀏覽

分享至

Select及SelectMany的差別在前一章的說明後應該有個初步的了解了，知道了應用的方式後我們接著來看看它是怎麼做到的吧。

原始碼分析

Source Code: https://github.com/dotnet/corefx/blob/master/src/System.Linq/src/System/Linq/SelectMany.cs
Public Method: SelectMany總共有四個多載的方法，兩個是只有一個selector，另外兩個是有兩個selector(collectionSelector及resultSelector)

/* 下面兩個方法只有一個selector */
public static IEnumerable<TResult> SelectMany<TSource, TResult>(    // 4
    this IEnumerable<TSource> source, 
    Func<TSource, IEnumerable<TResult>> selector);
    
public static IEnumerable<TResult> SelectMany<TSource, TResult>(    // 1
    this IEnumerable<TSource> source, 
    Func<TSource, int, IEnumerable<TResult>> selector); // 多一個int參數

/* 下面兩個方法有兩個selector */
public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(   // 2
    this IEnumerable<TSource> source, 
    Func<TSource, IEnumerable<TCollection>> collectionSelector, 
    Func<TSource, TCollection, TResult> resultSelector);
    
public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(   // 3
    this IEnumerable<TSource> source, 
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector, // 多一個int參數
    Func<TSource, TCollection, TResult> resultSelector);

方法的右邊註解的數字為等下講解的順序

第一個看的是只有一個selector但有int傳入參數的方法:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source, Func<TSource, int, IEnumerable<TResult>> selector)
{
    if (source == null)
    {
        throw Error.ArgumentNull(nameof(source));
    }

    if (selector == null)
    {
        throw Error.ArgumentNull(nameof(selector));
    }

    return SelectManyIterator(source, selector);
}

判斷source及selector是否為空，為空的話丟ArgumentNull例外
傳回SelectManyIterator

到這裡跟Select幾乎一模一樣，唯一有差別的就只有最後的回傳值，SelectMany是傳回SelectManyIterator，相信特別之處就是在這裡，我們來看看SelectMany的定義:

private static IEnumerable<TResult> SelectManyIterator<TSource, TResult>(
    IEnumerable<TSource> source, Func<TSource, int, IEnumerable<TResult>> selector)
{
    int index = -1;
    foreach (TSource element in source)
    {
        checked
        {
            index++;
        }

        foreach (TResult subElement in selector(element, index))
        {
            yield return subElement;
        }
    }
}

此方法區塊為yield區塊，會轉為Iterator Pattern，回傳的資料是IEnumerable的集合
yield return傳回每一個元素的資料
每個元素的index較前面的元素多加1
selector執行後取得每個元素的子集合資料，再用foreach巡覽整個子集合
傳回子集合的每個元素

我們可以看到跟SelectIterator還是幾乎一模一樣，差別在於第二個foreach，還記得我們前面講SelectMany的應用時比較了跟Select的差別之處就是SelectMany不用多一個迴圈去處理子集合的資料，從原始碼中觀察就更加明顯了，原來SelectMany已經幫我們把第二個迴圈要做的事情給做掉了。

接著我們要來看有兩個selector(collectionSelector及resultSelector)但CollectionSelector沒有int參數的方法:

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source, 
    Func<TSource, IEnumerable<TCollection>> collectionSelector, 
    Func<TSource, TCollection, TResult> resultSelector)
{
    if (source == null)
    {
        throw Error.ArgumentNull(nameof(source));
    }

    if (collectionSelector == null)
    {
        throw Error.ArgumentNull(nameof(collectionSelector));
    }

    if (resultSelector == null)
    {
        throw Error.ArgumentNull(nameof(resultSelector));
    }

    return SelectManyIterator(source, collectionSelector, resultSelector);
}

判斷source及selector是否為空，空的話丟出ArgumentNull的例外
傳回SelectManyIterator

這個方法跟剛剛介紹的第一個SelectMany的方法差在多了一個if判斷resultSelector是否為空，然後回傳的方法SelectManyIterator有三個參數，想當然，這裡不會是重點所在，我們接著來看看這個有三個參數的SelectManyIterator:

private static IEnumerable<TResult> SelectManyIterator<TSource, TCollection, TResult>(
    IEnumerable<TSource> source, 
    Func<TSource, IEnumerable<TCollection>> collectionSelector, 
    Func<TSource, TCollection, TResult> resultSelector)
{
    foreach (TSource element in source)
    {
        foreach (TCollection subElement in collectionSelector(element))
        {
            yield return resultSelector(element, subElement);
        }
    }
}

回傳值為resultSelector執行後的結果

跟第一個介紹的方法差別只差在子集合的元素要回傳前再去執行了resultSelector，這樣的目的就是可以輸出子集合跟原集合合併的資料。

我們可以看到它跟第一個方法的結構是完全一樣的，但是有resultSelector的幫助讓我們可以更省力的拿到自己想要的檔案。

第三個方法是有int的collectionSelector的方法，因為方法的實作跟第二個完全一樣，我們就直接來看SelectManyIterator的實作:

private static IEnumerable<TResult> SelectManyIterator<TSource, TCollection, TResult>(
    IEnumerable<TSource> source, 
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector, 
    Func<TSource, TCollection, TResult> resultSelector)
{
    int index = -1;
    foreach (TSource element in source)
    {
        checked
        {
            index++;
        }

        foreach (TCollection subElement in collectionSelector(element, index))
        {
            yield return resultSelector(element, subElement);
        }
    }
}

學一套就會了全部，這個方法中完全沒有新的東西，只是把第一跟第二個方法合併起來而已。

用index來給予每個selector位置的資訊(第一個方法)
回傳resultSelector執行結果(第二個方法)

接著我們要來看最後一個方法了:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, IEnumerable<TResult>> selector)
{
    if (source == null)
    {
        throw Error.ArgumentNull(nameof(source));
    }

    if (selector == null)
    {
        throw Error.ArgumentNull(nameof(selector));
    }

    return new SelectManySingleSelectorIterator<TSource, TResult>(source, selector);
}

跟前面的方法一樣，判斷參數是否為空，如果都是合法的就傳給Iterator做事，眼尖的人應該有發現到這次叫用的Iterator跟前面方法叫用的並不相同，看來是有什麼秘密藏在這裡喔，我們來看看吧:

private sealed class SelectManySingleSelectorIterator<TSource, TResult> : 
    Iterator<TResult>, 
    IIListProvider<TResult>

這是一個實作了Iterator的Class，看到Iterator的Class自然就會想看看它的MoveNext，以下是它的實作:

public override bool MoveNext()
{
    switch (_state)
    {
        case 1:
            // Retrieve the source enumerator.
            _sourceEnumerator = _source.GetEnumerator();
            _state = 2;
            goto case 2;
        case 2:
            // Take the next element from the source enumerator.
            if (!_sourceEnumerator.MoveNext())
            {
                break;
            }

            TSource element = _sourceEnumerator.Current;

            // Project it into a sub-collection and get its enumerator.
            _subEnumerator = _selector(element).GetEnumerator();
            _state = 3;
            goto case 3;
        case 3:
            // Take the next element from the sub-collection and yield.
            if (!_subEnumerator.MoveNext())
            {
                _subEnumerator.Dispose();
                _subEnumerator = null;
                _state = 2;
                goto case 2;
            }

            _current = _subEnumerator.Current;
            return true;
    }

    Dispose();
    return false;
}

_state在GetEnumerator()時會設為1(請參考第11章-Select的原碼探險)
_state為1時取得集合的Enumerator
_state為2時對集合執行_selector取得目標資料，到這裡為止就是Select的MoveNext()所做的事，但SelectMany將目標資料的Enumerator取得放進_subEnumerator並且進入第三狀態(_state=3)
_state為3時將子集合的Enumerator(_subEnumerator)做巡覽放進_current裡面，如果巡覽終止則將狀態調回2(_state=2)

實際上SelectMany比Select多了一層的MoveNext()來取得子集合的元素資料，達到扁平化的目的。

測試案例分析

Source Code: https://github.com/dotnet/corefx/blob/master/src/System.Linq/tests/SelectManyTests.cs

ParameterizedTests

[Theory]
[MemberData(nameof(ParameterizedTestsData))]
public void ParameterizedTests(IEnumerable<int> source, Func<int, IEnumerable<int>>selector)
{
    var expected = source.Select(i => selector(i)).Aggregate((l, r) => l.Concat(r));
    var actual = source.SelectMany(selector);

    Assert.Equal(expected, actual);
    Assert.Equal(expected.Count(), actual.Count()); // SelectMany may employ an optimized Count implementation.
    Assert.Equal(expected.ToArray(), actual.ToArray());
    Assert.Equal(expected.ToList(), actual.ToList());
}

public static IEnumerable<object[]> ParameterizedTestsData()
{
    for (int i = 1; i <= 20; i++)
    {
        Func<int, IEnumerable<int>> selector = n => Enumerable.Range(i, n);
        yield return new object[] { Enumerable.Range(1, i), selector };
    }
}

Aggregate()會將每個目前巡覽的結果向後一個元素丟，以上述程式碼為例Aggregate((l, r) => l.Concat(r)):

l: 前個元素執行Aggregate後的值
r: 目前的元素值

所以Aggregate((l, r) => l.Concat(r))是把所有元素合為一個IEnumerable，而在這個測試案例我們可以發現到SelectMany()可以轉為Select().Aggregate((l, r) => l.Concat(r))。

DisposeAfterEnumeration

這是一個驗證Dispose執行的測試，在剛剛觀察程式碼後我們知道在巡覽(MoveNext())過程中會產生兩層Enumerator(Source及Sub)，這個測試就是要確定Enumerator都有在應該Dispose時Dispose，由於案例較長，我們節錄重要的部分:

using (e)   // Enumerator
{
    while (e.MoveNext())
    {
        int item = e.Current;

        Assert.Equal(subState[subIndex], item); // Verify Current.
        Assert.Equal(index / subLength, subIndex);

        // 第一層的Source巡覽結束後才會Dispose
        Assert.False(sourceDisposed); // Not yet.

        // This represents whehter the sub-collection we're iterating thru right now
        // has been disposed. Also not yet.
        Assert.False(subCollectionDisposed[subIndex]);  // 目前的Sub因還在執行巡覽所以不會Dispose

        // However, all of the sub-collections before us should have been disposed.
        // Their indices should also be maxed out.
        Assert.All(subState.Take(subIndex), s => Assert.Equal(subLength + 1, s));

        // 此Source中的其他已巡覽完的Sub會Dispose
        Assert.All(subCollectionDisposed.Take(subIndex), t => Assert.True(t));        

        index++;
    }
}

// 巡覽結束Source會Dispose
Assert.True(sourceDisposed);
Assert.Equal(sourceLength, subIndex);
Assert.All(subState, s => Assert.Equal(subLength + 1, s));
Assert.All(subCollectionDisposed, t => Assert.True(t));

Source在全部巡覽完後Dispose
之前的Sub都應該Dispose

CollectionInterleavedWithLazyEnumerables_ToArray

我們都知道IEnumerable只會知道目前的元素資料，它是屬於一種延遲執行的巡覽方式，而陣列跟Enumerable的狀況不同，它一開始就知道所有的元素值了，那如果把它們兩個同時放到一個IEnumerable要如何處理呢? 例如像下面這樣:

// Marker at the end
new IEnumerable<int>[]
{
    new TestEnumerable<int>(new int[] { 0 }),
    new TestEnumerable<int>(new int[] { 1 }),
    new TestEnumerable<int>(new int[] { 2 }),
    new int[] { 3 },
}

當然我們還是可以把陣列當作IEnumerable去做處理，這在一般的處理中是可以的(因為都必須要Call MoveNext())，但在ToArray()中你的目標本來就是要轉為Array了，你卻要把陣列轉成IEnumerable再轉成Array怎麼樣都划不來，因此SelectMany的ToArray()用了一個Marker來表示集合中的陣列，我們先來看程式碼:

public TResult[] ToArray()
{
    var builder = new SparseArrayBuilder<TResult>(initialize: true);
    var deferredCopies = new ArrayBuilder<IEnumerable<TResult>>();

    foreach (TSource element in _source)
    {
        IEnumerable<TResult> enumerable = _selector(element);

        /* 
         * 是陣列(或是非延遲的集合)嗎?
         * Yes: 將其的位置(Index)及數量(Count)加到builder.Markers中，然後傳回true
         * No: 加到builder中，然後回傳false
         */
        if (builder.ReserveOrAdd(enumerable))
        {
            // 陣列內容加到deferredCopies中
            deferredCopies.Add(enumerable);
        }
    }

    // 將builder中的資料做ToArray的動作(因為已經排除了陣列的資料，所以沒有做多餘的轉換)
    TResult[] array = builder.ToArray();

    ArrayBuilder<Marker> markers = builder.Markers; // 取得陣列位置(Index)及數量(Count)資訊
    for (int i = 0; i < markers.Count; i++)
    {
        Marker marker = markers[i];
        IEnumerable<TResult> enumerable = deferredCopies[i];    // 取得陣列內容
        EnumerableHelpers.Copy(enumerable, array, marker.Index, marker.Count);  // 複製到剛剛builder轉出來的陣列中
    }

    return array;
}

從上述的程式碼說明可以知道ToArray()的運作細節，這個測試案例就是在測這個部分，我是因為看到這個案例才知道它的實作方式這麼的特別，所以才在測試案例這個單元做說明，如果想要深入了解的可以仔細看看這個測試案例。